Raster processing in arcpy (performance issues)

SebastianSantibanez · ‎10-06-2011

Hi,
I have created a piece of code that allows me to create a cost distance raster per feature (point) per feature class in a GDB iterating firstly the FCs (applying 'for' in FC list) and then the rows (using cursor). The script works fine but it takes an overwhelmingly long time to complete the process (40 hours for ~15,000 features).
- Is there any way to speed up the script?
- Somebody suggested me that the cursor might be the cause of the performance I am getting. Is this possible? if so, how can I fix it?
- Another approach that I am considering is using another function to calculate the distances (not arcpy) does anybody know another python module that I could use instead?

Thanks so much.

ChrisSnyder · ‎10-06-2011

I am in your same boat. I wrote a large Python-based process that runs a fancy monte carlo / bootstrap habitat simulation. It builds about 50,000 individual cost distance grids (actually cost path grids, but same difference), and takes a really long time to run. Some things I found, used, plan to use, and want to use in a future v10 rewrite:

The "max distance" parameter in the CostDistance tool doesn't work.

I added an extra step to buffer the point, and then set the gp.extent to the buffer extent, so as to limit the size of the cost distance surface. That way I wasn't building the cost distance grid values for areas I didn't need it.

GRID format (instead of FGDB) might be faster (even in v10). I wrote my script for v9.3, and at that time GRID format was faster. That probably might have changed in v10.0, but I haven't tested it.

In v10.0 it appears you can now run concurrent spatial analyst processes (for example, run 4 CostDistance tools in seperate python.exe instances at the same time) using os.spawnv, subprocess, multiprocessing, etc modules. In v9.3 you could only run one spatial analsyt process at once. This would be an immense performace boost, but I haven't rewritten my script in arcpy yet to take advantage of this improvment (I plan to though in a few months).

I didn't to it for CostPath function, but using a Python in-memory structure like a dictionary or numpy array you can write your own custom CostPath (or whatever) function. I have done a bit of backwards engineering of ESRI functions, but at a great investment in time (this is why my agency pays ESRI $300k+ a year, right???). I sure do wish ESRI would provide an "out of the box" in-memory grid structure (make raster data compatible with their in_memory workspace), so that we wouldn't have to write our own functions... Or at least maybe they can publish some of their spatial analyst source code (yeah right!).

That said, there are probably some ways to speed up the script. One thing that would be easy - just use a larger cell size. The cursor probably isn't the realy culprit... I assume you are using it to extract out the individual point features, correct? Per non-ESRI cost distance functions, I am sure there are many. For example: http://grass.fbk.eu/gdp/html_grass64/r.cost.html. Just waiting for someone to write it in pure Python.... If you do, POST IT!!!

ChrisSnyder · ‎10-06-2011

Oh, one other thing I found was that in v9.3, using the gp.snapraster environment slowed down many of the Spatial Analyst functions (some by more than just a few seconds). So in my case 2 sec x 50,000 = 28 hours. My solution was to write my own "snapraster" function that would alter the gp.extent setting. Using my function (as opposed to gp.snapraster) I experienced no gp.snapraster-related slowdown...

def snapExtentToRaster(inputExtent, snapRaster):
    "Returns a xMin, yMin, xMax, yMax extent snapped to the cell allignment of a raster"
    xMinInput = float(inputExtent.split(" ")[0])
    yMinInput = float(inputExtent.split(" ")[1])
    xMaxInput = float(inputExtent.split(" ")[2])
    yMaxInput = float(inputExtent.split(" ")[3])
    dscSnapRaster = gp.describe(snapRaster)
    cellSize = float(dscSnapRaster.meancellheight)
    xMinSnap = float(dscSnapRaster.extent.xmin)
    yMinSnap = float(dscSnapRaster.extent.ymin)
    xMaxSnap = float(dscSnapRaster.extent.xmax)
    yMaxSnap = float(dscSnapRaster.extent.ymax)
    #Calculates a modified xMinInput
    if xMinSnap < xMinInput:
        if divmod(abs(xMinInput - xMinSnap), cellSize)[1] / cellSize < .5:
            xMinOutput = xMinSnap + divmod(abs(xMinInput - xMinSnap), cellSize)[0] * cellSize
        else:
            xMinOutput = xMinSnap + cellSize + divmod(abs(xMinInput - xMinSnap), cellSize)[0] * cellSize
    else:
        if divmod(abs(xMinInput - xMinSnap), cellSize)[1] / cellSize < .5:
            xMinOutput = xMinSnap - divmod(abs(xMinInput - xMinSnap), cellSize)[0] * cellSize
        else:
            xMinOutput = xMinSnap - cellSize - divmod(abs(xMinInput - xMinSnap), cellSize)[0] * cellSize      
    #Calculates a modified yMinInput
    if yMinSnap < yMinInput:
        if divmod(abs(yMinInput - yMinSnap), cellSize)[1] / cellSize < .5:
            yMinOutput = yMinSnap + divmod(abs(yMinInput - yMinSnap), cellSize)[0] * cellSize
        else:
            yMinOutput = yMinSnap + cellSize + divmod(abs(yMinInput - yMinSnap), cellSize)[0] * cellSize
    else:
        if divmod(abs(yMinInput - yMinSnap), cellSize)[1] / cellSize < .5:
            yMinOutput = yMinSnap - divmod(abs(yMinInput - yMinSnap), cellSize)[0] * cellSize
        else:
            yMinOutput = yMinSnap - cellSize - divmod(abs(yMinInput - yMinSnap), cellSize)[0] * cellSize     
    #Calculates a modified xMaxInput
    if xMaxSnap < xMaxInput:
        if divmod(abs(xMaxInput - xMaxSnap), cellSize)[1] / cellSize < .5:
            xMaxOutput = xMaxSnap + divmod(abs(xMaxInput - xMaxSnap), cellSize)[0] * cellSize
        else:
            xMaxOutput = xMaxSnap + cellSize + divmod(abs(xMaxInput - xMaxSnap), cellSize)[0] * cellSize
    else:
        if divmod(abs(xMaxInput - xMaxSnap), cellSize)[1] / cellSize < .5:
            xMaxOutput = xMaxSnap - divmod(abs(xMaxInput - xMaxSnap), cellSize)[0] * cellSize
        else:
            xMaxOutput = xMaxSnap - cellSize - divmod(abs(xMaxInput - xMaxSnap), cellSize)[0] * cellSize
    #Calculates a modified yMaxInput
    if yMaxSnap < yMaxInput:
        if divmod(abs(yMaxInput - yMaxSnap), cellSize)[1] / cellSize < .5:
            yMaxOutput = yMaxSnap + divmod(abs(yMaxInput - yMaxSnap), cellSize)[0] * cellSize
        else:
            yMaxOutput = yMaxSnap + cellSize + divmod(abs(yMaxInput - yMaxSnap), cellSize)[0] * cellSize
    else:
        if divmod(abs(yMaxInput - yMaxSnap), cellSize)[1] / cellSize < .5:
            yMaxOutput = yMaxSnap - divmod(abs(yMaxInput - yMaxSnap), cellSize)[0] * cellSize
        else:
            yMaxOutput = yMaxSnap - cellSize - divmod(abs(yMaxInput - yMaxSnap), cellSize)[0] * cellSize     
    #Returns the entire modified extent
    return str(xMinOutput) + " " + str(yMinOutput) + " " + str(xMaxOutput) + " " + str(yMaxOutput)

#Set the gp.extent to that of the nestIdsBufferFC = nestId and use snapExtentToRaster() function to set the snapraster instead of gp.snapraster (which slows things down)
gp.extent = ""
tempExtentML = "in_memory\\temp_extent"
gp.Select_analysis(nestIdsBufferFC, tempExtentML, "NEST_ID = " + str(nestId))
gp.extent = tempExtentML
gp.extent = snapExtentToRaster(gp.extent, habGrd)

KimOllivier · ‎10-06-2011

Things have changed a lot at 10.x for Spatial Analyst. Before you look for fancy alternatives, just refactor it for the new Python Grid Algebra.

The key improvement that stands out here is that the grid is held in memory, so you could iterate over the same grid without reloading it each time, and the output is also held in memory until you want to save it.

But if that is still to slow, could numpy be programmed to run the same processes?

Cup of Coffee Rule
If any process takes longer than a cup of coffee, interrupt it and fix it or find a better way.

I would never accept a model that takes 40 hours. What if you have to re-run it?

SebastianSantibanez · ‎10-07-2011

Chris,
Thank you for your advice I will definitely take a look at the GRASS implementation.

Kim,
I am using arcpy (arcmap10) and the cost distance tool takes 3 to 5 seconds per point. Can you be more specific about your suggestion of reviewing the new python grid algebra?. Thank you.

NilsBabel · ‎04-13-2012

Things have changed a lot at 10.x for Spatial Analyst. Before you look for fancy alternatives, just refactor it for the new Python Grid Algebra.

The key improvement that stands out here is that the grid is held in memory, so you could iterate over the same grid without reloading it each time, and the output is also held in memory until you want to save it.

But if that is still to slow, could numpy be programmed to run the same processes?

Cup of Coffee Rule
If any process takes longer than a cup of coffee, interrupt it and fix it or find a better way.

I would never accept a model that takes 40 hours. What if you have to re-run it?

Heads up, your raster objects created in arcpy (v10) are not held in memory. They are temporary rasters that are written to disk. They disappear when your script completes but they are written to disk (either in your current workspace or scratchworkspace). Check it out next time you're running a script. Open up a file browser and refresh it while the script is working with raster objects, you'll see some strangely named rasters appear in your folder and then disappear. Apparently, in 10.1 you can create rasters in in_memory. That should be nice.

Good luck,

Nils