Skip navigation
All People > Dan_Patterson > Py... blog > 2018 > October
2018

NumPy vs Pandas

 

JIVE has fouled up the python syntax formatting in blogs...

hopefully this will be resolved soon

 

Well actually Pandas can't exist without NumPy.  But apparently it has a 'friendlier' face than its parental unit .

This demonstrates a difference.

 

We will begin with an array derived from a table from within ArcGIS Pro. 

 

I used the TableToNumPyArray

 

TableToNumPyArray—Data Access module | ArcGIS Desktop 

 

A classy function.  

The array

a # ---- the array from the table ----

array([( 1, 0, 'B', 'A_', 'Hall', 26), ( 2, 1, 'C', 'C_', 'Hall', 60),
( 3, 2, 'D', 'A_', 'Hall', 42), ( 4, 3, 'C', 'A_', 'Hall', 57),
( 5, 4, 'C', 'B_', 'Hall', 51), ( 6, 5, 'B', 'B_', 'Hosp', 14),
( 7, 6, 'C', 'A_', 'Hall', 45), ( 8, 7, 'B', 'B_', 'Hosp', 51),
( 9, 8, 'B', 'A_', 'Hall', 28), (10, 9, 'C', 'C_', 'Hosp', 58),
(11, 10, 'B', 'B_', 'Hosp', 6), (12, 11, 'C', 'A_', 'Hall', 49),
(13, 12, 'B', 'A_', 'Hosp', 42), (14, 13, 'C', 'A_', 'Hosp', 60),
(15, 14, 'C', 'B_', 'Hosp', 41), (16, 15, 'A', 'A_', 'Hosp', 53),
(17, 16, 'A', 'A_', 'Hall', 42), (18, 17, 'A', 'C_', 'Hall', 59),
(19, 18, 'C', 'C_', 'Hosp', 37), (20, 19, 'B', 'B_', 'Hall', 52)],
dtype=[('OBJECTID', '<i4'), ('f0', '<i4'), ('County', '<U2'), ('Town', '<U6'), ('Facility', '<U8'), ('Time', '<i4')])

 

OOOO screams the crowd in disgust... it is confusing looking.

 

So we will revert to the every favorite Pandas.  

import pandas as pd

pd.DataFrame(a)

OBJECTID f0 County Town Facility Time
0 1 0 B A_ Hall 26
1 2 1 C C_ Hall 60
2 3 2 D A_ Hall 42
3 4 3 C A_ Hall 57
4 5 4 C B_ Hall 51
5 6 5 B B_ Hosp 14
6 7 6 C A_ Hall 45
7 8 7 B B_ Hosp 51
8 9 8 B A_ Hall 28
9 10 9 C C_ Hosp 58
10 11 10 B B_ Hosp 6
11 12 11 C A_ Hall 49
12 13 12 B A_ Hosp 42
13 14 13 C A_ Hosp 60
14 15 14 C B_ Hosp 41
15 16 15 A A_ Hosp 53
16 17 16 A A_ Hall 42
17 18 17 A C_ Hall 59
18 19 18 C C_ Hosp 37
19 20 19 B B_ Hall 52

 

AHHHHHH the crowd cheers... much better looking.

 

But wait! If you just wanted the array to look pretty, you can do that with numpy and python easily as well.

How about...

id OBJECTID f0 County Town Facility Time
-------------------------------------------------
000 1 0 B B_ Hall 11
001 2 1 A A_ Hall 24
002 3 2 C C_ Hosp 43
003 4 3 A B_ Hall 43
004 5 4 B B_ Hall 16
005 6 5 B A_ Hall 8
006 7 6 A C_ Hall 26
007 8 7 B C_ Hall 31
008 9 8 C C_ Hall 7
009 10 9 A A_ Hall 58
010 11 10 A A_ Hosp 20
011 12 11 C A_ Hosp 37
012 13 12 C B_ Hall 36
013 14 13 A B_ Hosp 33
014 15 14 C C_ Hosp 51
015 16 15 B C_ Hosp 53
016 17 16 C A_ Hosp 21
017 18 17 C C_ Hosp 42
018 19 18 A B_ Hosp 43
019 20 19 A C_ Hall 5

 

??????? now the crowd is confused.  Which is better? Which looks better?

The choice is yours.  Maybe I will reveal pd_ at sometime but I might rename it to dp_ in homage to its coder.

 

# ---- quick prints and formats

Too much?  How about a quick_prn of the array with edge items, width specification all done in a couple lines of code

quick_prn(a, max_lines=10)

Array fields:
('OBJECTID', 'f0', 'County', 'Town', 'Facility', 'Time')
[( 1, 0, 'B', 'A_', 'Hall', 26)
( 2, 1, 'C', 'C_', 'Hall', 60)
( 3, 2, 'D', 'A_', 'Hall', 42)
...
(18, 17, 'A', 'C_', 'Hall', 59)
(19, 18, 'C', 'C_', 'Hosp', 37)
(20, 19, 'B', 'B_', 'Hall', 52)]

 

A few lines of code... a def for multiple purposes

def quick_prn(a, edgeitems=3, max_lines=25, wdth=100, decimals=2, prn=True):
"""Format a structured array by reshaping and replacing characters from
the string representation
"""

wdth = min(len(str(a[0])), wdth)
with np.printoptions(precision=decimals, edgeitems=edgeitems,
threshold=max_lines, linewidth=wdth):
print("\nArray fields:\n{}\n{}".format(a.dtype.names, a))

 

---- Analysis? ----

What about pivot tables? Excel has them, so does Pandas.  A quick call to some small python/numpy defs and...

 

Can't do the blue bit though.

 

---- Comment -----

 

Of course this blog is purely in jest, but it serves to point out that not is all that it seems and that everything has a root.  Remember your origins, even if that means in coding.

Conda  vs clones

 

I am beginning to wonder if the reluctance of people to install packages using conda has to do with the interface.

The Package Manager in ArcGIS Pro pre  2.2 was a nice addition, although a poorer incarnation than the Anaconda package manager... but kudos for putting one in the software for those that like to keep everything together.  Esri pulled the rug out in Pro 2.2 when the killed the Package Manager from doing anything useful.

 

Side note

If you don't want to use conda through the recommended channels and interface.... how about using it in Spyder's IPython Console?

 

If not.... carry on...

-------------------------------------------------------------------------------------------------------------------------------------------------------------

---- (1) ----

So rather than explain how to install packages using conda... which I have done elsewhere... I am going to focus on how to make the proenv.bat file aka this

 

---- (2) ----

Which is really a shortcut to this!  (sneaky sneaky!)

 

---- (3) ----

When you get there, this is what you see... dark... bleak, uninviting and for some, just plain scary!

 

---- (4) ----

Right-click on the windows top area to bring up the context menu, and select Properties

 

---- (5) ----

Now away you go... start with the cursor size

 

 

---- (6) ----

Don't forget the font!

 

---- (7) ----

Fussy where and how it appears on opening???

 

---- (8) ----

How about those colors!!! less intimidating???

 

Now don't waste too much time... Dr Google has RGB listings for your favorite colors.  FYI  Just don't set background and text to the same color... but if someone did.. you would know how to fix it to continue doing conda installs

 

 

---- (9) ----

Too lazy to close windows... make the window opaque.

 

 

---- (10) ----

The final wrap... ready for you

 

And don't forget to install your packages.

 

Have fun!

Scipy ndimage morphology … I am betting that is the first thought that popped into your head.  A new set of tools for the raster arsenal.

But wait!  Their terminology is different than that used by ArcGIS Pro or even ArcMap.

What tools are there?  Lots of filtering tools, but some of the interesting ones are below

from scipy import ndimage as nd
dir(nd)[... snip ...
'affine_transform', ... , 'center_of_mass', 'convolve', 'convolve1d', 'correlate',
'correlate1d', ... 'distance_transform_edt', ... 'filters', ... 'generic_filter',
... 'geometric_transform', ... 'histogram', 'imread', 'interpolation', ... 'label',
'labeled_comprehension', ... 'measurements', ... 'morphology', ...  'rotate', 'shift',
'sobel',...]

 

I previously covered distance_transform_edt which performs the equivalent of Euclidean Distance and Euclidean Allocation in this post

Euclidean distance, allocation and other stuff...  

 

Let's begin with a raster constructed from 3x3 windows with a repeating pattern and repeats of the pattern.

You can construct your own rasters using numpy, then apply a NumPyArrayToRaster to it.  I have covered this before, but here is the code to produce the raster

def _make_arr_():
    """Make an array with repeating patterns
    """

    a = np.arange(1, 10).reshape(3, 3)
    aa = np.repeat(np.repeat(a, 3, axis=1), 3, axis=0)
    aa = np.tile(aa, 3)
    aa = np.vstack((aa, aa))
    return aa

 

---- The raster (image) ----

 

---- (1) Expand ----

Expand the zone 1 values by 2 distance units

 

---- (2) Shrink ----

Shrink the zone 1 values by 1 distance unit.

 

---- (3) Regiongroup ----

Produce unique regions from the zones.  observe that the zone 1 values in the original are given unique values first. Since there were 6 zone 1s, they are numbered from 1 to 6.  Zone 2 gets numbered from 7 to 12.  and that pattern of renumbering repeats.

 

---- (4) Aggregate ----

Aggregation of the input array using the mode produces a spatial aggregation retaining the original values.  Our original 3x3 kernels are now reduced to a 1x1 size which has 3 times the width and height (9x area).

---- (5) Some functions to perform the above ----

 

In these examples buff_dist is referring to 'cells'.

For the aggregation, there are a number of options as shown in the header. 

Generally integer data should be restricted to the mode, min, max, range and central (value) since the median and mean upscale to floating point values.  This of course can be accommodated by using the python statistics package median_low and median_high functions:

 

statistics — Mathematical statistics functions — Python 3.7.1rc1 documentation 

 

So think of a function that you want.  Filtering is a snap since you can 'stride' an array using any kernel you want using plain numpy or using the builtin functions from scipy.

 

Enough for now...

 

def expand_(a, val=1, mask_vals=0, buff_dist=1):
    """Expand/buffer a raster by a distance
    """

    if isinstance(val, (list, tuple)):
        m = np.isin(a, val, invert=True).astype('int')
    else:
        m = np.where(a==val, 0, 1)
    dist, idx = nd.distance_transform_edt(m, return_distances=True,
                                          return_indices=True)
    alloc = a[tuple(idx)]
    a0 = np.where(dist<=buff_dist, alloc, a)  #0)
    return a0


def shrink_(a, val=1, mask_vals=0, buff_dist=1):
    """Expand/buffer a raster by a distance
    """

    if isinstance(val, (list, tuple)):
        m = np.isin(a, val, invert=False).astype('int')
    else:
        m = np.where(a==val, 1, 0)
    dist, idx = nd.distance_transform_edt(m, return_distances=True,
                                          return_indices=True)
    alloc = a[tuple(idx)]
    m = np.logical_and(dist>0, dist<=buff_dist)
    a0 = np.where(m, alloc, a)  #0)
    return a0


def regions_(a, cross=True):
    """currently testing regiongroup
    np.unique will return values in ascending order

    """

    if (a.ndim != 2) or (a.dtype.kind != 'i'):
        msg = "\nA 2D array of integers is required, you provided\n{}"
        print(msg.format(a))
        return a
    if cross:
        struct = np.array([[0,1,0], [1,1,1], [0,1,0]])
    else:
        struct = np.array([[1,1,1], [1,1,1], [1,1,1]])
    #
    u = np.unique(a)
    out = np.zeros_like(a, dtype=a.dtype)
    details = []
    is_first = True
    for i in u:
        z = np.where(a==i, 1, 0)
        s, n = nd.label(z, structure=struct)
        details.append([i, n])
        m = np.logical_and(out==0, s!=0)
        if is_first:
            out = np.where(m, s, out)
            is_first = False
            n_ = n
        else:
            out = np.where(m, s+n_, out)
            n_ += n
    details = np.array(details)
    details = np.c_[(details, np.cumsum(details[:,1]))]
    return out, details


def aggreg_(a, win=(3,3), agg_type='mode'):
    """Aggregate an array using a specified window size and an aggregation
    type

    Parameters
    ----------
    a : array
        2D array to perform the aggregation
    win : tuple/list
        the shape of the window to construct the blocks
    agg_type : string aggregation type
        max, mean, median, min, mode, range, sum, central

    """

    blocks = block(a, win=win)
    out = []
    for bl in blocks:
        b_arr = []
        for b in bl:
            if agg_type == 'mode':
                uni = np.unique(b).tolist()
                cnts = [len(np.nonzero(b==u)[0]) for u in uni]
                idx = cnts.index(max(cnts))
                b_arr.append(uni[idx])
            elif agg_type == 'max':
                b_arr.append(np.nanmax(b))
            elif agg_type == 'mean':
                b_arr.append(np.nanmean(b))
            elif agg_type == 'median':
                b_arr.append(np.nanmedian(b))
            elif agg_type == 'min':
                b_arr.append(np.nanmin(b))
            elif agg_type == 'range':
                b_arr.append((np.nanmax(b) - np.nanmin(b)))
            elif agg_type == 'sum':
                b_arr.append(np.nansum(b))
            elif agg_type == 'central':
                b_arr.append(b.shape[0]//2, b.shape[1]//2)
            else:
                tweet("\naggregation type not found {}".format(agg_type))
                b_arr.append(b.shape[0]//2, b.shape[1]//2)
        out.append(b_arr)
    out = np.array(out)
    return out

Your standard fare in raster world

 

Fill in some holes … 'buffer' a line... allocate space to stuff in space... identify unique regions

Fancy names.  Nibble, Euclidean distance, Euclidean allocation, Regiongroup

 

---- (1) The task ----

Start with a raster or array.  Fancy labels in the top left, some random-ish color scheme with values noted in the middle.

Now, zero ( 0 ) we will say is nodata.  The other numbers represent some class value. 

 

Fill in the 'gaps'... aka, nodata, with the value of a cell with data based on the closest distance.  Euclidean, crow flies, for our purposes, but it need not be. 

Go! 

What did you get for cells A04, H01 and H02?  What about cell D07 and H08?

 

 

---- (1) The reveal ----

Let's see how we did

A04 - 2   can't be 1 because the diagonal to '1' is 1.414*cell width, so 2 it is

H01 - 2   could have been a 2 or a 3, because both are 3 cells away from cells with values

H02 - 3   no brainer

D07 - 3   could have been 3 or 4, but 3 wins

H08 -3    3 is 2 cells away and 1 and 5 are a greater distance on an angle

 

---- (3) The distances ----

Pretty boring and obvious... but for completeness.

 

I don't do maps, but the dominant colors are 0 or root 2 as you can see from the spiffy ArcGIS Pro symbology tab.So no big surprises

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

---- (4) Allocation around linear features ----

 

Yes, that is possible too, sort of like a variable buffer, trade area, but currently just a simple Euclidean spatial allocation.

 

 

---- (5) The references ----

 

 Euclidean distance

Euclidean Allocation

NumPy/SciPy/  plus Arcpy stuff solution is what I used

 

import sys
import numpy as np
from scipy import ndimage as nd
from arcpy.geoprocessing import env
from arcpy.arcobjects import Point
from arcgisscripting import NumPyArrayToRaster, RasterToNumPyArray

env.overwriteOutput = True

def euclid_calc(a, mask_vals=0, dist=True, alloc=True):
    """Calculate the euclidean distance and/or allocation

    Parameters:
    a : array
        numpy float or integer array
    mask_vals : number, list or tuple
        If a single number is provided, a `mask` will be created using it.  A
        list or tuple of values can be used to provide multiple value masking.
    dist : boolean
        True, the distance of the closest non-masked value to the masked cell
    alloc : boolean
        True, the value of the closest non-masked value to the masked cell
    """

    if not dist:
        dist = None
    if not alloc:
        alloc = None
    m = np.isin(a, mask_vals).astype('int')
    dist, idx = nd.distance_transform_edt(m, return_distances=True,
                                          return_indices=True)
    alloc = a[tuple(idx)]
    return dist, alloc

# ----------------------------------------------------------------------
# .... final code section
if len(sys.argv) == 1:
    testing = True
    r_in = r"C:\GIS\A_Tools_scripts\Raster_tools\Data\poly.tif"
    a = RasterToNumPyArray(r_in)
    dist, alloc = euclid_calc(a, mask_vals=0, dist=True, alloc=True)
else:
    testing = False
    r_in = sys.argv[1]
    r_out = sys.argv[2]
    LLx = sys.argv[3]
    LLy = sys.argv[4]
    cell_sze = sys.argv[5]
    a = RasterToNumPyArray(r_in)
    dist, alloc = euclid_calc(a, mask_vals=0, dist=True, alloc=True)
    #
    # specify a, dist or alloc below... this could be a tool choice if needed
    #
    r0 = NumPyArrayToRaster(a, Point(LLx, LLy), cell_sze) 
    r0.save(r_out)

 

Of course, when run from within the amazing Spyder python IDE, you can simply skip saving to output raster if needed, or try various other options from with the scipy slate other than the distance_transform_edt

 

 

 

On to other "special analyst" tools soon.