Arcpy Multiprocessor issues

StacyRendall1 · ‎06-27-2011

Hi,

I am currently writing an arcpy script that churns through some large datasets and extracts data; to make this quicker I figured it might be worth using something like Parallel Python or the inbuilt Multiprocessing library. Ideally I require the script to be run from within Arc (well... there seems to be an inescapable Schema Lock problem using direct iterators within the script, which can be circumvented by using modelbuilder iterators that run the script once per iteration, but that is another story).

However, I can get a script by either method to run fine outside of Arc (still calling arcpy), but running from within Arc the processing job gets part way through before failing:

with Parallel Python it gets to the point where job_server = pp.Server() is defined, then opens another instance of ArcMap if the script is being run from ArcMap, or Catalog if running from Catalog. From Catalog the script sits waiting (progress bar doing its swishy thing, but no processing occurring) until this new instance is closed, at which point the script fails with the following:<class 'struct.error'>: unpack requires a string argument of length 8

With Multiprocessing it gets as far as actually running the module, again attempts to open a new instance of whichever program it was started from. If running from ArcMap it opens an instance for each process initiated; each of which starts with an error dialog stating â??Could not find file: from multiprocessing.forking import main; main().mxdâ?�, upon clicking OK the instance opens as normal. Upon closing all new instances, the initial ArcMap becomes unresponsive (sometimes the processing dialog is still there, swishing away, sometimes it has disappeared), although there is no error message.

I have created a simple python script for Multiprocessing which can reproduce the error; running fine if run directly, but causing errors as stated above when running from Arc, see below. Is anyone else familiar with these issues? Can they be confirmed on other setups? It appears that Arc is somehow trying to open parts of the modules that the processes are using, but I am not sure why. Any help would be greatly appreciated!

import arcpy
import multiprocessing

def simpleFunction(arg):
return arg*3

if __name__ == "__main__":
arcpy.AddMessage(" Multiprocessing test...")
jobs = []
pool = multiprocessing.Pool()
for i in range(10):
job = pool.apply_async(simpleFunction, (i,))
jobs.append(job)

for job in jobs: # collect results from the job server
print job.get()

del job, jobs
arcpy.AddMessage(" Complete")
print "Complete"

ArcGIS Desktop 10.0 sp2, ArcInfo Licence running on Windows 7 with Core i7 and 8GB RAM.

ChrisSnyder · ‎07-12-2011

Wow - Very nice! I'd have to take a while to digest all that...

I agree, there is a significant amount of overhead splitting things up, managing the separate processes (multiprocess seems to make it easier), and then squishing it all back together. Not for the faint hearted...

If you are interested, here's my stone age attempt at parallel processing using os.spawnv (this code is almost 5 years old now!). I have a large process that I run every 3 months or so for my bosses, just a big overlay of our land base (riparian areas, forest inventory, harvest areas, habitat concerns, etc.) that we use for reporting, forest modeling, etc. Originally, due to the size of all the layers involved, the union wouldn't run at all, so I create a "tile" feature class (composed of ~ 40 little rectangular polygons, and then would run the overlay for each tile - one at a time. Then I figured, hey, I could run many tiles at once (but just had to write some code to regulate the timing of it all). Instatnt paralell process! Anyway, it's pretty stone age (uses .txt files to comunicate!), but it works. The "run_union_slave" script does the actual geoprocessing, this "master" script below just manages the overall parallel process.

# Description
# -----------
# This master script regulates the execution of the script run_union_slave.py (for each tile)
# Author: Chris Snyder, WA Department of Natural Resources, chris.snyder(at)wadnr.gov

import sys, string, os, shutil, time, traceback, glob
try:
    #Defines some functions
    def showPyMessage():
        try:
            print >> open(logFile, 'a'), str(time.ctime()) + " - " + str(message)
            print str(time.ctime()) + " - " + str(message)
        except:
            pass 
    def launchProcess(tileNumber):
        global message
        global numberOfProcessors
        message = "Processing tile #" + str(tileNumber); showPyMessage()
        #Added this to address bug with indexTiles.shp used in the overlay tools in v9.2+ (can't run more than 1 overlay process at once)
        newTempDir = r"C:\temp\gptmpenvr_" + time.strftime('%Y%m%d%H%M%S')
        os.mkdir(newTempDir)
        os.environ["TEMP"] = newTempDir
        os.environ["TMP"] = newTempDir
        message = "Set TEMP and TMP variables to " + newTempDir; showPyMessage()
        #End bug fix...
        parameterList = [pythonExePath, slaveScript, root, str(tileNumber), yyyymmdd]
        if tileNumber <= indxPolyFeatureCount:
            if tileNumber == indxPolyFeatureCount or numberOfProcessors == 1:
                message = "Waiting for tile #" + str(indxPolyFeatureCount) + " to finish..."; showPyMessage()
                os.spawnv(os.P_WAIT, pythonExePath, parameterList)
            else:
                os.spawnv(os.P_NOWAIT, pythonExePath, parameterList)
        time.sleep(1); message = "Waiting a few seconds"; showPyMessage()
    
    #Specifies the root variable, makes the logFile variable, and does some error checking...
    dateTimeStamp = time.strftime('%Y%m%d%H%M%S')
    root = sys.argv[1] #r"E:\1junk\overlay_20060531"
    yyyymmdd = sys.argv[2] #this is passed from the master script since the tiles may be processed over a 2+ day period
    if os.path.exists(root)== False:
        print "Specified root directory: " + root + " does not exist... Bailing out!"
        sys.exit()
    scriptName = sys.argv[0].split("\\")[len(sys.argv[0].split("\\")) - 1][0:-3] #Gets the name of the script without the .py extension  
    logFile = root + "\\log_files\\" + scriptName + "_" + dateTimeStamp[:8] + ".log" #Creates the logFile variable
    if os.path.exists(root + "\\log_files") == False: #Makes sure log_files exists
        os.mkdir(root + "\\log_files")
    if os.path.exists(logFile)== True:
        os.remove(logFile)
        message = "Deleting log file with the same name and datestamp... Recreating " + logFile; showPyMessage()
    workspaceDir = root + "\\index_tiles"
    if os.path.exists(workspaceDir)== True:
        message = "Overlay directory: " + workspaceDir + " already exist... Deleting and recreating " + workspaceDir; showPyMessage()
        shutil.rmtree(workspaceDir)
    else:
        message = "Creating index tiles directory: " + workspaceDir; showPyMessage()
    os.mkdir(workspaceDir)
        
    #Process: Finds python.exe
    pythonExePath = ""
    for path in sys.path:
        if os.path.exists(os.path.join(path, "python.exe")) == True:
            pythonExePath = os.path.join(path, "python.exe")
    if pythonExePath != "":
        message = "Python.exe file located at " + pythonExePath; showPyMessage()
    else:
        message = "ERROR: Python executable not found! Exiting script...."; showPyError(); sys.exit()

    #Determines the number of processors on the machine...
    numberOfProcessors = int(os.environ.get("NUMBER_OF_PROCESSORS"))
    maxNumberOfProcessors = 3 #The maximum number of processors you want to use
    if numberOfProcessors > maxNumberOfProcessors:
        numberOfProcessors = maxNumberOfProcessors

    #Determines the number of tiles in the index layer (tile numbers are assumed to be sequential)
    #Note: I didn't use gp.getcount so that this script wouldn't use the gp at all (and eat up more memory)
    indxPolyCountFile = glob.glob(root + "\\log_files\\therearethismanytiles_*.txt")
    if len(indxPolyCountFile) == 0:
        message = "Can't find index polygon count .txt file in " + root + "\\log_files" + "! Exiting script..."; showPyMessage()
        sys.exit()
    indxPolyFeatureCount = int(indxPolyCountFile[0].split("_")[-1][0:-4])
    if indxPolyFeatureCount == 0:
        message = "Index polygon count is 0! Exiting script..."; showPyMessage()
        sys.exit()
    
    tileNumber = 1
    numberOfProcesses = 0
    slaveScript = r"\\Snarf\am\div_lm\ds\gis\ldo\current_scripts\run_union_slave_v93.py"

    tilesThatAreProcessingList = []
    tilesThatFinishedList = []
    tilesThatBombedList = []
    
    #Process: This while loop initialy launches <numberOfProcessors> instances of the slave script  
    while numberOfProcesses < numberOfProcessors:
        numberOfProcesses = numberOfProcesses + 1
        tilesThatAreProcessingList.append(tileNumber)
        launchProcess(tileNumber)
        tileNumber = tileNumber + 1
    #Process: This while loop checks for tiles that finished or bombed, if there are, new slave scripts are launched
    while len(tilesThatBombedList) + len(tilesThatFinishedList) < indxPolyFeatureCount:
        isDoneList = glob.glob(root + "\\log_files\\" + slaveScript.split("\\")[len(slaveScript.split("\\")) - 1][0:-3] + "_isalldone_*") #makes a list of the .txt file created by the slave script
        doneBombedList = glob.glob(root + "\\log_files\\" + slaveScript.split("\\")[len(slaveScript.split("\\")) - 1][0:-3] + "_bombed_*")
        if len(isDoneList) == 0 and len(doneBombedList) == 0: #if there are no .txt files, wait for x number of seconds
            time.sleep(5)
        else: #else if there are .txt files...
            if len(isDoneList) > 0: #if there are tiles that are "_isalldone_"
                for isDoneItem in isDoneList: #for each .txt file indicating completion...
                    tileThatIsDone = int(isDoneItem[isDoneItem.rfind("_") + 1:isDoneItem.rfind(".")])
                    os.remove(root + "\\log_files\\" + slaveScript.split("\\")[len(slaveScript.split("\\")) - 1][0:-3] + "_isalldone_" + str(tileThatIsDone) + ".txt")
                    message = "Tile #" + str(tileThatIsDone) + " is done!!!"; showPyMessage()
                    tilesThatFinishedList.append(tileThatIsDone)
                    tilesThatAreProcessingList.remove(tileThatIsDone)
                    tilesThatAreProcessingList.append(tileNumber)
                    launchProcess(tileNumber)
                    tileNumber = tileNumber + 1
                    time.sleep(5)
            if len(doneBombedList) > 0: #if there are tiles that "_bombed_"   
                for doneBombedItem in doneBombedList: #for each .txt file indicating failure...
                    tileThatBombed = int(doneBombedItem[doneBombedItem.rfind("_") + 1:doneBombedItem.rfind(".")])
                    os.remove(root + "\\log_files\\" + slaveScript.split("\\")[len(slaveScript.split("\\")) - 1][0:-3] + "_bombed_" + str(tileThatBombed) + ".txt")
                    message = "Tile #" + str(tileThatBombed) + " bombed!!!"; showPyMessage()
                    tilesThatBombedList.append(tileThatBombed)
                    tilesThatAreProcessingList.remove(tileThatBombed)
                    tilesThatAreProcessingList.append(tileNumber)
                    launchProcess(tileNumber)
                    tileNumber = tileNumber + 1
                    time.sleep(5)
            message = "Tiles that are currently processing: " + str(tilesThatAreProcessingList); showPyMessage()
            message = "Tiles that are done: " + str(tilesThatFinishedList); showPyMessage()
            message = "Tiles that bombed: " + str(tilesThatBombedList); showPyMessage()
            time.sleep(5)

    if len(tilesThatBombedList) > 0:
        message = "ERROR - these tiles failed to process: " + str(tilesThatBombedList); showPyMessage()
        sys.exit()
        
    message = "ALL DONE!"; showPyMessage()
    print >> open(root + "\\log_files\\" + scriptName + "_isalldone.txt", 'a'), scriptName + "_isalldone.txt"
except:
    message = "\n*** LAST GEOPROCESSOR MESSAGE (may not be source of the error)***"; showPyMessage()
    message = "\n*** PYTHON ERRORS *** "; showPyMessage()
    message = "Python Traceback Info: " + traceback.format_tb(sys.exc_info()[2])[0]; showPyMessage()
    message = "Python Error Info: " +  str(sys.exc_type)+ ": " + str(sys.exc_value) + "\n"; showPyMessage()

ChristianKienholz · ‎12-22-2011

Thanks for sharing your examples and experiences.
I have attempted to parallelize one of my scripts using the multiprocessing module. I split the data beforehand, concrete, I have a separate data/results folder for every parallel process / core. I also have a python module that does the parallel part. This parallelprocessing-module contains a couple of different arcpy-functions and also cursors.
Now here�??s the issue: if I use only one core
pool = multiprocessing.Pool(1)
then everything works fine all the time. Well, this is not parallel processing�?�
As soon as I apply more than one core, the processing is not stable any more. E.g. using
pool = multiprocessing.Pool(2)
I get errors in maybe 30% of the runs (I run the code with exactly the same data and exactly the same settings). The errors cannot be reproduced but are always thrown by arcpy-functions. One of the errors thrown today: 010088 Invalid input geodataset (Layer, Tin, etc.).
The more cores I use, the more often errors occur. Sometimes, Python even crashes (�??python.exe has stopped working�?�). Using 4 cores, I get errors in essentially every run.
Has anybody run into similar problems? My parallel parts do never access the same data, so I�??m a little confused. Thanks for your help. Chris

ChrisSnyder · ‎02-28-2012

As soon as I apply more than one core, the processing is not stable any more

Are you sure you aren't running out of memory? I have notice that a lot of the geoprocessing tools in ArcGIS v10 are far more "memory hungry" than in v9.3 (SelectByLocation and SpatialJoin being some examples). Which is probably why these tools in v10 appear to run faster...

When an arcpy seasoned python.exe process runs out of memory while running a geoprocessing task, it tends to thow a rather obscure and unexpected errors in tools that otherwise (when run by themselves) run just fine.

I think many times these seemingly random failures are actually just a reflection of if/when the individual processes have gone over their memory limitations.

I haven't tried it myself, but if you have gobs of memory available, you can configure ArcGIS v10.0 to use more memory than the conventional 2.1 GB (which is actually more like 1.4 GB with overhead) - up to ~4GB orf RAM if running a 64-bit OS. An excellent post on the topic: http://forumsstg.arcgis.com/threads/26863-Memory-amp-CPU-settings-for-ArcGIS-10?p=150304&viewfull=1#..., which I am probably going to get around to trying this week.

ChrisSnyder · ‎03-01-2012

I finally got around to rewriting a psudo-parallel processing framework that uses the subprocess module instead of the old fashioned os.spawnv module. I couldn't get the multiprocessing module to work well for me, so I wrote my own code to do the same sort of thing: Regulate parallel processes. Note that I am not claiming this code is superior to the Multiprocessing module in any way (quite the oposite). However, since I wrote it, I undertand it, and that is priceless for me! I hope that others might find it usefull. I tried to provide usefull comments in the code.

This is an updated (and frankly MUCH better) version of the code I had posted here:http://forums.arcgis.com/threads/33602-Arcpy-Multiprocessor-issues?p=116611&viewfull=1#post116611

Note the example parent and child scripts below don't use any ESRI objects, but they certainly could. In the parent script, all the subprocesses are conveniently tracked in a dictionary consisting of:
jobDict[jobId] = [applicationExePath, childScriptPath,[list of input varibles], status {"NOT_STARTED", "IN_PROGRESS", "SUCCEEDED", "FAILED"}, subprocess.Popen object]

If you can make this code better in some way or have an idea to make it better, please post it!

PARENT SCRIPT EXAMPLE:

#Import some modules
import os, random, subprocess, sys, time 

#Process: Define a function that will launch the processes and update the job dictionary accordingly
def launchProcess(jobId):
    global jobDict #make this a global so the function can read it
    inputVar1 = jobDict[jobId][2][0] #Input variables are being read from the job dictionary
    inputVar2 = jobDict[jobId][2][1]
    jobDict[jobId][4] = subprocess.Popen([jobDict[jobId][0], jobDict[jobId][1], str(inputVar1), str(inputVar2)], shell=False, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    jobDict[jobId][3] = "IN_PROGRESS" #Indicate the job is 'IN_PROGRESS'

#Determine how many processes to run concurrently
numberOfProcessorsToUse = 3 #The number of processors you (the user) wants to use
if numberOfProcessorsToUse > int(os.environ.get("NUMBER_OF_PROCESSORS")):
    numberOfProcessorsToUse = int(os.environ.get("NUMBER_OF_PROCESSORS"))

#Populate a "job dictionary" to keep track of all the subprocess jobs and all their various inputs and outputs
childScriptPath = r"C:\csny490\simple_subprocess_child.py"
jobDict = {}
for jobId in range (1,21): #this loop is just showing how you can populate the jobId's and their input variables - in this example theere will be 20 jobs
    inputVar1 = random.randrange(1,6) #this variable tells each subrocess how many seconds it will "sleep" for - between 1 and 5 seconds
    inputVar2 = random.randrange(2,9) #in the child proces script, if inputVar2 > 6 it will throw an exception, and the subprocess will be 'FAILED' - I simply did this to create the possibility of failed subprocesses
    #Format of jobDict[jobId] = [applicationExePath, childScriptPath, [list of input varibles], status {"NOT_STARTED", "IN_PROGRESS", "SUCCEEDED", "FAILED"}, subprocess.Popen object]
    jobDict[jobId] = [os.path.join(sys.prefix, "python.exe"), childScriptPath, [inputVar1, inputVar2], "NOT_STARTED", None]

#Process: Kick off some processes, monitor the processes, and start new processes as others finish
kickOffFlag = False #Indicate the process kick off has not yet occured
while len([i for i in jobDict if jobDict[3] in ("SUCCEEDED","FAILED")]) < len(jobDict):
    if kickOffFlag == False:
        while len([i for i in jobDict if jobDict[3] != 'NOT_STARTED']) < numberOfProcessorsToUse:
            launchProcess([i for i in jobDict if jobDict[3] == 'NOT_STARTED'][0]) #Feed the appropriate jobId to the launchProcess() function 
        kickOffFlag = True #Set the flag as True once we have done the initial kickoff
    for jobId in [i for i in jobDict if jobDict[3] == 'IN_PROGRESS' and jobDict[4].poll() != None]: #if an subprocess is listed as 'IN_PROGRESS' and polls as 0 (i.e. "done" but success or failure unknown)
        if jobDict[jobId][4].returncode == 0: #return code of 0 indicates success (no sys.exit(1) command encountered in the child process
            jobDict[jobId][3] = "SUCCEEDED"
        if jobDict[jobId][4].returncode > 0: #return code of 1 (or another integer value) indicates failure (a sys.exit(1) command was encountered in the child process)
            jobDict[jobId][3] = "FAILED"
        if len([i for i in jobDict if jobDict[3] == 'NOT_STARTED']) > 0: #if there are still jobs = 'NOT_STARTED', launch the next one in line
            launchProcess([i for i in jobDict if jobDict[3] == 'NOT_STARTED'][0])
        print "--------------------------------------"
        for jobId in jobDict:
            print "Job ID " + str(jobId) + " = " + str(jobDict[jobId][3])

CHILD SCRIPT EXAMPLE:

try:
    import sys, time
    var1 = int(sys.argv[1])
    var2 = int(sys.argv[2])
    print "Nothing to do, so sleeping for " + str(var1) + " seconds..."
    time.sleep(var1)
    if var2 > 6:
        test = "oh crap" + 1 #concatenating a string an int here is intended to throw an exception on purpose to demonstrate a failure
    print "Epic Success!"
    #Note: If script runs through without error, return code will be, by default, 0 - indicating success
except:
    print "Epic Fail!" #Note: You can gather/parse all these print messages in the parent scrip using jobDict[jobId][4].communicate()
    sys.exit(1)  #Note: If script fails, return code is 1 - indicating failure - you can set the return code to be > 0 if you want...

ChrisSnyder · ‎03-09-2012

If anyone uses any of that code above, I found a bit of a bug..

Basically if your child subprocess is quite chatty (like a bunch of logging to a text file), the subprocess.stdout buffer in the memory can fill up and cause the child subprocess to hang forever! If stdout is sent to PIPE and is more than 65536 characters, the subprocess will hang.

After pulling my hair out for a few hours, I found the answer here: http://thraxil.org/users/anders/posts/2008/03/13/Subprocess-Hanging-PIPE-is-your-enemy/

Some work arounds:
--------------------
1. Don't have a chatty subproces script (stdout memory buffer is 65536 characters or so, which can fill up fast).

2. Don't use stdout = PIPE if you don't care about stdout messages. This is what I did in my case, because all my parent script needs is the subprocess .poll() and .returncode
So for example, I did this:
subprocess.Popen([jobDict[jobId][0], jobDict[jobId][1], str(inputVar1), str(inputVar2)], shell=False)
instead of this:
subprocess.Popen([jobDict[jobId][0], jobDict[jobId][1], str(inputVar1), str(inputVar2)], shell=False, stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE)

3. Send stdout to an actual file on disk instead of memory... all the "filelike" stdout methods (like .readlines) are the same anyway! File or memory - doesn't really matter.

PhilipThiem · ‎04-05-2013

Regarding the original poster's comment (yes I realize this is dated) about not being able to use multiprocessing from inside arc programs. I too had a similar problem, and managed to get something working.

Under windows, python does not (cannot?) make use of a fork() type api which would make the whole setup for multiprocess much more straight forward. Instead, python first pickles the Process (multiprocessing.process.Process) object (which contains a function reference and arguments) and certain parts of the sys module. Then the process executable (assumed to be a python interpreter) and is invoked with '-c "from multiprocessing.fork import main; main" --multiprocessing-fork' to start the boot strap process. Obviously, this is a problem because for an in-process tool the process executable is NOT python, it is arcmap.exe or arccatalog.exe. Fortunately, the multiprocessing API provides a work around:

PYTHON_EXE = os.path.join(sys.exec_prefix, 'pythonw.exe') #if you use python.exe you get a command window
multiprocessing.set_executable(PYTHON_EXE)

However, this would only work with certain scripts under arcmap/catalog. Notably, the script would crash if I tried to access arcpy.env.* and arcpy.env.keys() was empty. So something was up with arcpy. The remainder of the multiprocessing bootstrap is fairly "dumb", it just loads stuff from pipes, restores parts of the sys, unpickles and goes. Generally modules are not "pickled", in fact if you make a simple empty module, and run pickle.dumps(module) on it:

TypeError: can't pickle module objects

Modules will, however, be included in the pickle as a reference that says "load this module on unpickle." This absolutely means that every referenced modules much be "findable" via the current sys.path. As path problems were not being an issue, there had to be a problem with the initialization of the arcpy module. There are six OS environment variables (from os.environ) that affect the initialization of arcpy and they all start with "GP_".

GP_ENVPID_PATH

GP_ERRPID_PATH

GP_LICENSE_CODE

GP_PIPE_NAME

GP_PROXY_OPS

GP_VALIDATE_SCRIPT

These are used to pass information to out-of-process tools. For in process-tools they are blank, and in command line scripts, they don't exist. The problem is that by default, the child process will inherit the os environment variables of the parent. If multiprocessing is used from an in-process script tool, these environment variables will exist and be empty. Hence, arcpy will think it is in-process, when it really isn't. For an Out Of Process tool it really is out of process, though, I'm not sure if the environment which will be inherited will be the arc environments at the time of subprocess creation or OOP tool creation.

Finally, If an alternative OS environment omitting these variables is passed to the API function that creates a sub-process, it will work as if a command-line script. But.. I think this also means you might be checking out another license, and I can't seem to get arcpy to populate these variable unless I run an OOP script, and then they only exist for the OOP tool, so I would suspect that if a license is checked out one would be checking out a license for each process. But maybe not.

There would be two ways to possibly remove these variables. One might be able to just delete them from os.env and then use multiprocess, which might not be smart as it may affect the parent process. The second possibility would be to make a copy of os.env, remove the variables, and pass that copy to the new process. However, some of the multiprocessing objects need to be modified.

[INDENT]multiprocessing.process's Process object would need to be sub-classed to override start() so it can use a custom POpen object.
multiprocessing.forkings's POpen object would need to be sub-classed to do something like

env = dict(os.environ)
for x in list(env.iterkeys()):
    if x.startswith('GP_'):
        del env

hp, ht, pid, tid = _subprocess.CreateProcess(
    _python_exe, cmd, None, None, 1, 0, env, None, None
)

[/INDENT]

in the __init__(), noting that the called CreateProcess was changed to take the env variable. If one could emulate the the out of process setup for a tool, this would be same place to do this instead of deleting the GP_* os environment variables. This was all in 10.0sp5 under arccatalog. If you are using process pools, I suspect you'll need to sub-class the pool object to make use of the custom process object.

Oddly if you stay away from certain things (notably setting to set/get arc environments), you can probably use multiprocessing without any modifications. And looks like some tricks with arcpy.Exist(...) might help too, but for my purposes, didn't seem sufficient.