reclassify using Numpy, out of memory

18853
37
Jump to solution
05-19-2016 03:41 PM
RebeccaStrauch__GISP
MVP Emeritus

EDIT: although I marked Dan's answer re: making the array an integer array, which was the last piece I needed, I created a summary at the end https://community.esri.com/message/610325#comment-610325  to put all the pieces that worked together.  I suggest starting there if running into this problem.

Bottom line: trying to get reclass elevation raster (in m), that include “water” and NoData, and Denali foothills, to 3-4 class (based on feet---keeping the converted layer, not  in memory), then convert to polygons.

For the study are I used for development (and one other), the raster Reclassify and arcpy.RasterToPolygon_conversion this worked relatively fast. (max few minutes)

I’m now running it on an area about 2x larger (~25m cells, 6657 col 13035 row), with a wider range of input values.  When using my script/tool or testing in the python window, I run out of memory after about 4 hours.

So I’m diving into Numpy and https://community.esri.com/people/Dan_Patterson ‘s blog https://community.esri.com/blogs/dan_patterson/2016/03/12/reclassify-raster-data-simply

After a clean reboot, I’m trying my test script in the python window, but I’m still getting an out of error message.  I’m not sure if I have something in error in my script, or it is just too large for the 32bit-python to handle.  So, I’m open to suggestions…..maybe processing half at a time??  I have tried but a full raster with no nodata, and one that has nodata outside the side area boundary.

Test script I am using (parts borrowed from Dan's demo scripts)

import time
import os
import arcpy
from arcpy import env
from arcpy.sa import *
import numpy as np
from textwrap import dedent

arcpy.CheckOutExtension("spatial")

def arr_reclass(a, bins=[], new_bins=[], mask=False, mask_val=None):
  """a    - integer or floating point array to be reclassed using
     bins - sequential list/array of the lower limits of each class
        include one value higher to cover the upper range.
     mask - whether the raster contains nodata values or values to
        be masked with mask_val
     array dimensions will be squeezed 
  """ 
  a_rc = np.zeros_like(a)
  if (len(bins) < 2): # or (len(new_bins <2)):
    print("Bins = {} new = {} won't work".format(bins,new_bins))
    return a
  if len(new_bins) < 2:
    new_bins = np.arange(1,len(bins)+2)   
  new_classes = zip(bins[:-1],bins[1:],new_bins)
  for rc in new_classes:
    q1 = (a >= rc[0])
    q2 = (a < rc[1])
    z = np.where(q1 & q2, rc[2],0)
    a_rc = a_rc + z
  return a_rc

theWorkspace = r'C:\___bear2016\_Skwentna2017\Skwentna_prep.gdb'
arcpy.env.workspace = theWorkspace
arcpy.env.overwriteOutput = True
wsPath, wsGDB = os.path.split(theWorkspace)
print("Project gdb (vector, i.e. for points, lines, polygons): {0}".format(theWorkspace))

inStudy = r'C:\___bear2016\_Skwentna2017\Skwentna_prep.gdb\Study16b'
stdyPath, stdyName = os.path.split(inStudy)
print("Study boundary: {0}".format(stdyName))
spatialref = arcpy.Describe(inStudy).spatialReference

rasterIn = r'C:\___bear2016\_Skwentna2017\Skwentna_raster.gdb\elevExtent'
#rasterIn = r'C:\___bear2016\_Skwentna2017\Skwentna_raster.gdb\elevClip'
inRas = arcpy.Raster(rasterIn)

lowerLeft = arcpy.Point(inRas.extent.XMin, inRas.extent.YMin)
cellSize = inRas.meanCellWidth

rasterGDB, rasterFile = os.path.split(rasterIn)
elevUnits = "meters"
elevCutoffFt = "6000"
flatElevBreak = "2000"
elevBreaks = flatElevBreak.split("; ")
elevBreaks.sort(key=float)  
flatElevBreak = ";".join(elevBreaks)
flatElevBreak
'2000'
hydroBuff = r'C:\___bear2016\_Skwentna2017\Skwentna_prep.gdb\hydroBuff250'
finalTransZones = arcpy.os.path.join(theWorkspace, "TransAreas")
outRiparian = arcpy.os.path.join(theWorkspace, "TransRiparian")
pElevClass = arcpy.os.path.join(theWorkspace, "elevReclassPoly")  
outDEMtmp = arcpy.os.path.join(theWorkspace, "elevReclass")  

if elevUnits == "meters":   #  3.280839895013123
  toElevFeet = 3.28084
else:      # note, IF had to go from feet to meters, factor would be 0.3048
  toElevFeet = 1
elevFeet = (inRas * toElevFeet)
saveElevFeet = arcpy.os.path.join(rasterGDB, "elevFt")
elevFeet.save(saveElevFeet)
elevMin = elevFeet.minimum  - 1
elevMax = elevFeet.maximum  + 1
noDataReclass = 99999

elevSplit = flatElevBreak.split(";")
lstCnt = 1
#reclassArg = []
bins = []
new_bins = []
for reclassValue in elevSplit:
  if lstCnt == 1:
    #reclassArg.append([int(elevMin), int(reclassValue), int(reclassValue)])
    bins.append(int(elevMin))
    bins.append(int(-1))
    bins.append(int(reclassValue))
    new_bins.append(int(-1))
    new_bins.append(int(reclassValue))
  else:
    #reclassArg.append([int(prevValue), int(reclassValue), int(reclassValue)])
    bins.append(int(reclassValue))    
    new_bins.append(int(0))    
    new_bins.append(int(reclassValue))    
  lstCnt += 1
  prevValue = reclassValue
#reclassArg.append([int(prevValue), int(elevCutoffFt), int(elevCutoffFt)])
#reclassArg.append([int(elevCutoffFt), int(elevMax), 9])

bins.append(int(elevCutoffFt))
bins.append(int(elevMax))
new_bins.append(int(elevCutoffFt))
new_bins.append(9)  
new_classes = zip(bins[:-1],bins[1:],new_bins)

print("   ...Old Ranges:\n   {0}".format(bins))
print("   ...New values:\n   {0}".format(new_bins))
print("   ...New classes:\n   {0}".format(new_classes))


arr = arcpy.RasterToNumPyArray(inRas, nodata_to_value=9999999)
mask = False
mask_val = None
print(arr)
arr_rc = arr_reclass(arr, bins=bins, new_bins=new_bins)

The error I am getting is on the last line, however, looking at the resources, there is plenty of RAM available, so I'm assuming it is just the Arcmap/Python limit.

Output from script (in ArcMap python window)

Project gdb (vector, i.e. for points, lines, polygons): C:\___bear2016\_Skwentna2017\Skwentna_prep.gdb

Study boundary: Study16b

...Old Ranges:

   [-107504, -1, 2000, 6000, 20036]

   ...New values:

   [-1, 2000, 6000, 9]

   ...New classes:

   [(-107504, -1, -1), (-1, 2000, 2000), (2000, 6000, 6000), (6000, 20036, 9)]

[[  205.16717529   205.25009155   205.29598999 ...,  1037.9387207

   1037.86242676  1037.83959961]

[  205.22860718   205.3127594    205.37818909 ...,  1037.93701172

   1037.84533691  1037.76965332]

[  205.29447937   205.38237      205.46986389 ...,  1037.91918945

   1037.75964355  1037.76477051]

...,

[ 1481.97961426  1492.3326416   1504.91308594 ...,   510.62741089

    504.97958374   501.19992065]

[ 1469.56079102  1486.19445801  1505.37487793 ...,   506.6071167

    501.30090332   497.77194214]

[ 1453.34509277  1471.94152832  1493.07250977 ...,   501.66564941

    497.09884644   494.20864868]]

Runtime error

Traceback (most recent call last):

  File "<string>", line 115, in <module>

  File "<string>", line 30, in arr_reclass

MemoryError

>>>

If I need to split it due to size, some guidance for that would be good.  The one nice thing about this numpy process, I get the error immediately instead of 4-hours in....

Thanks for any help!

ps - Dan....is this question seen by more than the 13 members/followers?  Nice "place" but want to make sure others have a chance to find/reply

tagging a few more areas  Pythonpython snippetsImagery and Remote Sensing

37 Replies
Luke_Pinner
MVP Regular Contributor

I tend to read rasters directly to arrays in memory as chunks both because of the 2GB memmap limit and because it avoids the doubling up of file system IO (reading it in, saving it back out to npy, reading it back in).  But I'm usually dealing with large stacks of timeseries data.  Horses for courses

0 Kudos
DanPatterson_Retired
MVP Emeritus

Luke are you doing data prep and processing outside of Arc* ? or are you 'tiling' or equivalent when you mention chunks?  In this supposed world of big data, there seems little obvious interest in the data analysis side in the gis field.  Most complaints are the 1000 record limits for AGOL

0 Kudos
DanPatterson_Retired
MVP Emeritus

Just did some digging, python prior to 2.5 had the 2GB limit (even for 64 bit systems).  Numpy does not, there were some pretty ridiculously large files being read and used

see hpaulj's comment http://stackoverflow.com/questions/37365064/memory-mapping-slows-down-over-time-alternatives

Plus there are alternatives

http://stackoverflow.com/questions/26214696/creating-very-large-numpy-arrays-in-small-chunks-pytable...

h5py, pytables and Dask (haven't checked these all out yet, the first is best for simple single content arrays, the 2nd for tabular arrays, Dask is from Continuum, ) Array — dask 0.9.0 documentation

0 Kudos
Luke_Pinner
MVP Regular Contributor

Dask is pretty useful, I've played around with it, but need to get up to speed as it's being used as the distributed processing engine in v.2 of the Aust. Geoscience Data Cube API​.

I'll look at memmap again, didn't realise the limit was no longer there.  For most of what I do, reading to array then writing to memmap file, then reading in again just increases the amount of IO, but for larger timeseries it would be useful, or especially for windowed operations (aka focal in ArcGIS terminology) so no need to handle tiling edge effects.

DanPatterson_Retired
MVP Emeritus

Thanks for the references Luke, I will have a long look.  As for focal operations, I have only used stride tricks to get the windows I need.  At least you can emulate most of the SA functionality, albeit with limits for most people.  I will look into the GDC site in more detail.  I don't work with time series much except for a foray recently, which you may have seen 6,000+ rasters the follow-up  a bit of a patch job but interesting

0 Kudos
RebeccaStrauch__GISP
MVP Emeritus

Luke/Dan - looked briefly at your latest comments, but will look at them in depth in the morning.  Thanks again for all the comments and suggestions.  I think I am close and will hopefully be able to fold this process back into the full script tomorrow.

0 Kudos
RebeccaStrauch__GISP
MVP Emeritus

Luke, just an fyi, the reason I didn't use that is determining the lower-left for several chunks may have caused issues...the array is a much cleaned split.

0 Kudos
DanPatterson_Retired
MVP Emeritus

It may be split time... 

Now... you can create the array, save it to disk and load 'pieces', reclassify, save to disk, repeat then reassemble

I have no clue why you are running out of memory.  The arrays aren't that big, the processing within numpy barely makes a glitch time wise or memory wise. 

I would suggest you monitor your memory useage.  How much ram do you have anyway?

Anyway, the below is a verbose description with results so you can see the process conceptually.

Later...

replace the  below with, keeping the reclass function....

if __name__=="__main__":
    """run demo"""
    r = 8
    c = 10
    #  uncomment the 2 lines below to create the array only once, then recomment
    #  for subsequent testing
    #z = np.abs(100 * np.random.random_sample((r,c)) - 100)  # make rand value in 0-100 
    #np.save("z_8x10.npy",z) 
    z = np.load("z_8x10.npy")
    bins = [0, 10, 20, 30, 40, 60, 70, 80, 90, 100]
    new_bins = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
    z_rc = arr_reclass(z, bins=bins, new_bins=new_bins)
    z_hist = np.histogram(z_rc,bins=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
    result = np.array(list(zip(z_hist[1],z_hist[0])))
    print("input array\n{}\nreclassed array\n{}".format(z, z_rc))
    print("histogram....\n(class, count)\n{}".format(result))
    # now reclass in slices and stack back
    z_rc0 = arr_reclass(z[:4,:], bins=bins, new_bins=new_bins)
    z_rc1 = arr_reclass(z[4:,:], bins=bins, new_bins=new_bins)
    z_stack = np.vstack((z_rc0,z_rc1))
    frmt="""
    Now split the array into some bits, do the reclass and reformulate
    If you want to use memmap (see np.load), it just reads the bits from disk
    If you want, you can reuse the names to clear memory...this was not done
    in this case.  you can save intermediate results to disk as well.
    ....
    first split and reclass...
    {}
    second...
    {}
    all equal? np.allclose(z_stack, z_rc) = {}
    """
    print(dedent(frmt).format(z_rc0,z_rc1, np.allclose(z_stack, z_rc)))

The results

input array
[[ 57.04  17.38  74.4  18.04  25.    99.25  18.09  9.79  45.35  67.94]
[ 28.73  74.36  32.13  60.14  0.79  14.74  60.81  20.31  53.26  89.08]
[ 26.97  70.29  96.59  43.5  97.16  96.33  18.    82.99  35.37  40.31]
[  0.12  52.13  26.93  19.59  48.93  20.31  4.3  24.43  29.33  20.46]
[ 85.84  12.67  94.1  43.57  76.68  22.15  93.13  66.64  61.31  49.3 ]
[ 55.35  25.44  23.02  62.98  12.44  83.66  42.53  92.32  48.41  24.  ]
[ 70.27  45.64  3.    42.58  73.27  66.55  1.68  67.29  48.39  2.47]
[ 96.38  4.19  78.12  82.1  13.83  73.13  52.77  28.02  57.7    5.44]]
reclassed array
[[ 5.  2.  7.  2.  3.  9.  2.  1.  5.  6.]
[ 3.  7.  4.  6.  1.  2.  6.  3.  5.  8.]
[ 3.  7.  9.  5.  9.  9.  2.  8.  4.  5.]
[ 1.  5.  3.  2.  5.  3.  1.  3.  3.  3.]
[ 8.  2.  9.  5.  7.  3.  9.  6.  6.  5.]
[ 5.  3.  3.  6.  2.  8.  5.  9.  5.  3.]
[ 7.  5.  1.  5.  7.  6.  1.  6.  5.  1.]
[ 9.  1.  7.  8.  2.  7.  5.  3.  5.  1.]]
histogram....
(class, count)
[[ 1  9]
[ 2  9]
[ 3 14]
[ 4  2]
[ 5 17]
[ 6  8]
[ 7  8]
[ 8  5]
[ 9  8]]


Now split the array into some bits, do the reclass and reformulate
If you want to use memmap (see np.load), it just reads the bits from disk
If you want, you can reuse the names to clear memory...this was not done
in this case.  you can save intermediate results to disk as well.
....
first split and reclass...
[[ 5.  2.  7.  2.  3.  9.  2.  1.  5.  6.]
[ 3.  7.  4.  6.  1.  2.  6.  3.  5.  8.]
[ 3.  7.  9.  5.  9.  9.  2.  8.  4.  5.]
[ 1.  5.  3.  2.  5.  3.  1.  3.  3.  3.]]
second...
[[ 8.  2.  9.  5.  7.  3.  9.  6.  6.  5.]
[ 5.  3.  3.  6.  2.  8.  5.  9.  5.  3.]
[ 7.  5.  1.  5.  7.  6.  1.  6.  5.  1.]
[ 9.  1.  7.  8.  2.  7.  5.  3.  5.  1.]]
all equal? np.allclose(z_stack, z_rc) = True
RebeccaStrauch__GISP
MVP Emeritus

Dan, I'll take a look at this tomorrow.  I did test your other script, i.e. creating the random array, on  my home win10 machine.  Again, 10.3.1 gave the memory error.  I then ran it in Pro (using the file written before the other crashed) and it worked, so that does point to the version of Python and/or numpy being an issue.  However, I do need to make this work in ArcMap, not Pro, so the split option seems like a better solution.

Luke, I think the speed of the numpy or similar is the way for me to go, especially if I can quickly split the array and remap. With the arcpy remap taking +4 hours of running before it ran into issues with memory, at least the numpy version let me know in a matter of second (< min).  I had looked at chunking it up with the options available in RasterToNumpyArray, and that would probably work, if I figure out the logic to find the lowerleft corners of each split...and that might still be an option.  I was playing areound with the SpitRaster command, but my first attempt was a disaster since I messed up the attributes.  Anyway, I appreciate the effort, and in the long run I'll probably use bits of all the suggestions.

0 Kudos
DanPatterson_Retired
MVP Emeritus

yes it  is python version... split works too

0 Kudos