Arcp randomly crashes when runing iterated cost distance tool (spatial analyst)

SebastianSantibanez · ‎09-24-2012

Hi all,
I'm experiencing this random error when I run a record-iterated cost distance script in arcpy. The code I have used to work well in arcgis 10.0, but now in 10.1 is giving me only headaches. I want to outline that the code fails randomly at different stages of the iteration, however it has never gone past the 200th record.

Traceback (most recent call last):
File "C:\Users\urban4m_02\Documents\MSAu4m\py_codes\cost_dist_norm.py", line 67, in <module>
outCostDist = CostDistance(row[2], cost, maxDist*factor)
File "C:\Program Files (x86)\ArcGIS\Desktop10.1\arcpy\arcpy\sa\Functions.py", line 640, in CostDistance
out_backlink_raster)
File "C:\Program Files (x86)\ArcGIS\Desktop10.1\arcpy\arcpy\sa\Utils.py", line 47, in swapper
result = wrapper(*args, **kwargs)
File "C:\Program Files (x86)\ArcGIS\Desktop10.1\arcpy\arcpy\sa\Functions.py", line 634, in wrapper
out_backlink_raster)
File "C:\Program Files (x86)\ArcGIS\Desktop10.1\arcpy\arcpy\geoprocessing\_base.py", line 484, in <lambda>
return lambda *args: val(*gp_fixargs(args, True))
ExecuteError: ERROR 999999: Error executing function.
Workspace or data source is read only.
Workspace or data source is read only.
ERROR 010029: Unable to create the raster C:\Users\urban4m_02\Documents\MSAu4m\scratch.gdb\CostDis_f0F01. Cost Distance mapping Failed
ERROR 010067: Error in executing grid expression.
Failed to execute (CostDistance).

I'm running the iteration using

with arcpy.da.SearchCursor(inputFC,["OID@","SHAPE@XY","SHAPE@","ICOUNT"]) as cursor:
    for row in cursor:
        outCost = CostDistance(row[2], costSurface, maxDist)
        ...

and I have tried fixing it by adding
cursor.reset()
arcpy.CompressFileGeodatabaseData_management
gc.collect()
but no luck so far.

Has anybody else experienced similar issues when running an iterated spatial analyst tool in arcpy?

MathewCoyle · ‎09-25-2012

"Workspace or data source is read only." seems to indicate some rogue locking going on. I'd try creating a function for your SA calls and calling that from your loop. I've found it has helped me with the the dreaded 9's error.

I'm assuming you haven't changed computers when you switched from 10.0 to 10.1 and you haven't had any changes in local or domain permissions. The randomness of your errors seem to preclude these anyways.

How is your memory usage while running this? If your process starts getting over 2GB you are going to get some problems.

As an aside I only loop intensive tasks outside of cursors by creating a dictionary or list of the required parameters you want to base your loop on.

ChrisSnyder · ‎09-25-2012

Did you accidently run the Compress tool on that FGDB (thereby making it read only)?

FYI: Many raster functions (inclusing CostDistance) still operate significantly faster in GRID format vs FGDB raster format.

SebastianSantibanez · ‎09-26-2012

Did you accidently run the Compress tool on that FGDB (thereby making it read only)?

FYI: Many raster functions (inclusing CostDistance) still operate significantly faster in GRID format vs FGDB raster format.

Thank you for your input Chris.
No compress tool was ever applied to the FGDBs. Besides, I have restarted the process from new FGDBs and keep encountering the same problem.

ChrisSnyder · ‎09-26-2012

and I have tried fixing it by adding
cursor.reset()
arcpy.CompressFileGeodatabaseData_managementgc.collect()
but no luck so far.

Did you mean to use Compact instead? Compress makes the FGDB theoretically smaller (although not for Raster datasets), and read only. Run the Uncompress tool to make it writable again.

Maybe not totally related, but I am rewritting some v9.3 code to v10.1 that runs CostDistance in a loop many thousands of times... probably simliar to what you are doing. I'd enjoy swapping some performace best practices if you are willing... I like how you get the point geometry from the cursor use it as direct input to the CostDistance tool. Hope you don't mind I use it...

Some tricks I have found:
1. Buffer your input points layer, so that you can set the analysis extent to the point's buffer polygon to limit processing time/extent of eachcostdistance surface. For example:
arcpy.env.extent = arcpy.da.SearchCursor(nestSitesBufferFC, "SHAPE@", "POINT_ID = " + str(pointId)).next()[0].extent

2. Use Grid format! CostDistance runds way faster when using Grid format inputs and outputs.

3. Limit the number of grids in a single folder (this might apply to FGDB rasters as well?). I wrote some code that makes a new folder every 100th loop, which is where I noticed performnace taking a real nose dive. I suspect it has to figure out a scratch name every time, and the more rasters you have in a workspace, the longer it takes to come up with a unique scratch name since it has to list the contents every time. See: http://forums.arcgis.com/threads/67762-Spatial-Analyst-scratch-raster-naming-system-causing-slowness...

SebastianSantibanez · ‎10-03-2012

Thank you for your comments Chris!

I have compressed and uncompressed, compacted and uncompacted and started from scratch in new databases so I'm positive that the error is not there.

In my coding I use exactly the same practices you mentioned, with the exception that I normally define the "env extent" inside the loop starting with "SHAPE@XY" +/- the buffer distance (this probably saves me a couple of miliseconds since I don't have to create a buffered feature). Actually, if I'm not mistaken you gave me that idea in a previous post.

Anyway, I have been talking with a developer about this error and we agree that the debugging is going to be tedious and probably will not lead us to a permanent solution. So we have decided to translate this part of the development to R.cost (the grass module). I tested R.cost in grass and I would say that it runs at least 5X faster than arcgis.

As a side note, maybe you want to consider processing your cost distances in block. Select a subset of points far enough from each other so the cost distances will not overlap and process them altogether. I haven't tested this approach but in my mind it makes perfect sense... obviously the most complicated part will be creating those subsets (probably a heuristic search algorithm would help)

I hope it helps.

ChrisSnyder · ‎10-03-2012

Hi Sebastian,

That is a good idea to use the buffer geometry object instead. In my case I have a need to use those pnt buffer extents in several later scripts so I figured it would be faster just to build them once and reference those.

The block idea is a good one too. I hadn't thought of that.... Maybe in v2.0... A simple way might be to cut my study area into quadrants (4 rectangles) and simply ID the pnt buffers that exist "entirely within" a quadrant polygon. Then select them four at a time (random selection of quad pairs QUAD_IDs composed of 1,2,3,4), and process each quad pair at once. For my purpose I note that limiting the extent makes it run about 10 times as fast as the entire study area. So it might take some fancier algorithm design...

I really like your GRASS work around... Do you have any example script that make use of that? I'd love to see them. I have noted in the past that the ESRI cost<whatever> algorithms are only 8 direction which leads to some not-so-natural-looking outputs: http://forums.arcgis.com/threads/44464-Problems-with-cost-path-analysis-not-shortest-distance?p=1517.... It sure would be nice if they could include a Knights move option in the Cost* functions Hint, hint to any ESRI people that might see this...

In my own expereince I have noted some funky behavior of random crashes in SA tools... in the past I was able to "fix" the problem by using only raster inputs... For example, converting the input point feature in CostDistance or Watershed to a raster 1st, and then using the raster version of the point as input (yes even if the dang point was smack in the middle of the pixel). An interesting note: I have this new fancy computer that absolutely screems (Xeon 2687 with SSDs in RAID0). Am am really excited to have this machine (and its 8 ,but effectivly 16, cores) so that I can finally relly leverage some subprocess scripting and SA functions (previous to v10 you couldn't run concurrent SA tools). I noted that my subprocesses (15 concurrent ones!) occasionally "collided" and would randomly crash and fail... Not neccessarily CostDistance... but any SA tool. Big let down after I thouight I'd be able to do all this fancy concurrent SA processing finally. I guessed that all the seperate SA processes must be *** &^%$ing STILL*** trying to write some scratch temp file somewhere (occasoonally all at the same time). So after some real hair pulling, I figured out that creating some dummy TEMP directories BEFORE launching the subprocess script completely fixed the problem - *** boy was I happy to get that thing working!*** So my point: Could it be that you have concurrent SA processes that are all wrtting to the same TEMP folder and occassionally "colliding"?

If it helps, here's the latest incarnation of my "master" script that calles the scripts that calls the CostDistance building script as a subprocess :

# Description
# -----------
# This script is the master script that builds all the cost distance grids
# Written for Python version: 2.7.2 and above (PythonWin)
# Written for ArcGIS version: 10.1 SP0
# Fuzzy logic score scale of 0-100, 50 being neutral
# Author: Chris Snyder, WA Department of Natural Resources, chris.snyder(at)wadnr.gov
# Date created: 4/1/2010
# Last updated: 9/27/2012, csny490
try:
    #Import system modules
    import sys, os, time, traceback, subprocess

    #Defines some functions
    def launchProcess(jobId):
        global jobDict
        tmpDirName = r"C:\Temp\gp_tmp_" + time.strftime('%Y%m%d%H%M%S')
        os.mkdir(tmpDirName)
        os.environ["TEMP"] = tmpDirName
        os.environ["TMP"] = tmpDirName
        os.environ['TEMPDIR'] = tmpDirName
        inputVar1 = jobDict[jobId][2][0]
        inputVar2 = jobDict[jobId][2][1]
        inputVar3 = jobDict[jobId][2][2]
        jobDict[jobId][4] = subprocess.Popen([jobDict[jobId][0], jobDict[jobId][1], str(inputVar1), str(inputVar2), str(inputVar3)], shell=False)
        jobDict[jobId][3] = "IN_PROGRESS" #Indicate the job is 'IN_PROGRESS'
        time.sleep(2) #Give the subprocesses (especially arcpy enabled subprocesses) some time to catch up
    def showPyMessage():
        try:
            print >> open(logFile, 'a'), str(time.ctime()) + " - " + str(message)
            print str(time.ctime()) + " - " + str(message)
        except:
            pass
        
    #Specifies the root directory variable, defines the logFile variable, and does some minor error checking...
    root = r"C:\csny490\nso_model"
    if os.path.exists(root)== False:
        print "ERROR: Specified root directory " + root + " does not exist... Exiting script!";time.sleep(3);sys.exit()
    scriptName = sys.argv[0].split("\\")[-1].split(".")[0] #Gets the name of the script without the extension  
    dateTimeStamp = time.strftime('%Y%m%d%H%M%S') #in the format YYYYMMDDHHMMSS
    logFile = root + "\\" + scriptName + "_" + dateTimeStamp + ".log" #Creates the logFile variable
    if os.path.exists(logFile)== True:
        os.remove(logFile)
        message = "Deleting existing log file " + logFile + "... Recreating " + logFile; showPyMessage()

    #Process: Make sure the "cost_grids" dir exists...
    if os.path.exists(root + "\\cost_grids") == False:
        os.mkdir(root + "\\cost_grids")

    #Process: Defines some variables
    scenarioDict = {"NA": 'No Action', "LP": 'Landscape'}
    decadeList = [0,1,2,3,4,5,6,7,8,9]
    pathToScenarioDecadeGrids = root + "\\decade_scenario_grids"    

    #Determine how many processes to run concurrently
    numberOfProcessorsToUse = 15 #The number of processors you (the user) wants to use
    if numberOfProcessorsToUse > int(os.environ.get("NUMBER_OF_PROCESSORS")):
        numberOfProcessorsToUse = int(os.environ.get("NUMBER_OF_PROCESSORS"))

    #Process: Define the python.exe path and slaveScript path
    childProcExePath = os.path.join(sys.prefix, "python.exe")
    childScriptPath = r"\\snarf\am\div_lm\ds\gis\projects\oesf_eis_draft_second\nso_models\scripts\nso_model_scripts_v20120912\STEP5B_path_distance_slave_v101_20120912.py"

    #Process: Loop through the decades and scenarios
    jobDict = {}
    for decade in decadeList:
        for scenario in scenarioDict:
            jobDict[(scenario, decade)] = [childProcExePath, childScriptPath, [root, scenario, decade], "NOT_STARTED", None]
      
    #Process: Get those jobs going!
    while len([i for i in jobDict if jobDict[3] != 'NOT_STARTED']) < numberOfProcessorsToUse:
        launchProcess([i for i in jobDict if jobDict[3] == 'NOT_STARTED'][0]) #Feed the appropriate jobId to the launchProcess() function 
    while len([i for i in jobDict if jobDict[3] in ("SUCCEEDED","FAILED")]) < len(jobDict):
        time.sleep(10)
        for jobId in [i for i in jobDict if jobDict[3] == 'IN_PROGRESS' and jobDict[4].poll() != None]: #if an subprocess is listed as 'IN_PROGRESS' and polls as 'None' (i.e. "done" but success or failure unknown)
            if jobDict[jobId][4].returncode == 0: #return code of 0 indicates success (no sys.exit(1) command encountered in the child process
                jobDict[jobId][3] = "SUCCEEDED"
                message = "SUCCESS: " + str(jobId) + " completed successfully..."; showPyMessage()
            if jobDict[jobId][4].returncode > 0: #return code of 1 (or another integer value) indicates failure (a sys.exit(1) command was encountered in the child process)
                jobDict[jobId][3] = "FAILED"
                message = "ERROR: " + str(jobId) + " failed..."; showPyMessage()
            if len([i for i in jobDict if jobDict[3] == 'NOT_STARTED']) > 0: #if there are still jobs = 'NOT_STARTED', launch the next one in line
                launchProcess([i for i in jobDict if jobDict[3] == 'NOT_STARTED'][0])

    #Process: Final check for failures - if any, exit(1) 
    if len([i for i in jobDict if jobDict[3] == 'FAILED']) > 0:
        message = "ERROR: These jobs were an epic fail:"; showPyMessage()
        for jobId in [i for i in jobDict if jobDict[3] == 'FAILED']:
            message = str(jobId) + " failed..."; showPyMessage()
        sys.exit(1)

    message = "ALL DONE!"; showPyMessage()
    
except:
    message = "\n*** PYTHON ERRORS *** "; showPyMessage()
    message = "Python Traceback Info: " + traceback.format_tb(sys.exc_info()[2])[0]; showPyMessage()
    message = "Python Error Info: " +  str(sys.exc_type)+ ": " + str(sys.exc_value) + "\n"; showPyMessage()
    sys.exit(1)

SebastianSantibanez · ‎10-03-2012

Chris
I haven't started developing the prototype in grass yet, but when I have some news I will share them.
As far as concurrent SA processes go; the prototype I built ran a naive algorithm, and after so many unexpected crashes didn't feel like optimizing it. I will try your approach of starting with a raster and see if it makes any difference, maybe after that I will run concurrent processes.

Thank you again for your help.

PS: The cost surface runs a modified dijkstra's algorithm in the background, right? if so, it might be more efficient to write your own code.

ChrisSnyder · ‎10-03-2012

if so, it might be more efficient to write your own code

I thought about that some, but... I think that's why my employer pays ESRI like 100's of 1000's of $$$ a year... So I don't have to (or at least shouldn't have to!)... Sigh...

Anyway, I've finally got this thing down now so that what used to take 2.5 weeks in v9.3 (single process, 6 year old machine) now takes ~6 hours (with x15 concurrent SA subprocess running on a MUCH faster machine). So I'm content for the time being...