Pool from multiprocessing issues

7019
14
07-26-2011 04:48 PM
BurtMcAlpine
New Contributor III
I am trying to use pool from the multiprocessing to speed up some operations def worker(d) that happen once for each leayer in the mxd.  this on is hard coded to D:\TEMP\Untitled4.mxd.   it runs but only one at a time.  I can see it start the pool, but only on is being used.  any help would be great.  I am running it in arctool box in ArcMap and have unchecked run as process. like I said it runs, but only one at a time....

import arcpy
import os
import multiprocessing

def worker(d):
    # buffer layer by the below values
    bfs = [101, 200, 201, 400, 401, 750, 751, 1001,
           1500, 1501, 2000, 2001, 2500]
    for bf in bfs:
        Output = os.path.basename(d)[:-4] + "_Buffer" + str(bf) + ".shp"
        print "Buffering " + d + " at " + str(bf) + " Feet"
        if arcpy.Exists(d):
            arcpy.Buffer_analysis(d, "D:\\Temp\\" + Output, str(bf) + " Feet")
            arcpy.Project_management("D:\\Temp\\" + Output, "D:\\Temp\\Test\\" + Output, "C:\Program Files (x86)\ArcGIS\Desktop10.0\Coordinate Systems\Geographic Coordinate Systems\North America\NAD 1983.prj")
            arcpy.Delete_management("D:\\Temp\\" + Output)
        else:
            print "No Data"


if __name__ == '__main__':
   
   #Sets MXD
    mxd = arcpy.mapping.MapDocument("D:\TEMP\Untitled4.mxd")
    #mxd = arcpy.mapping.MapDocument("CURRENT")

    #set some environments needed to get the correct outputs
    arcpy.env.overwriteOutput = True
    arcpy.env.workspace  = "D:\TEMP\Test"
    arcpy.env.outputCoordinateSystem = "C:\Program Files (x86)\ArcGIS\Desktop10.0\Coordinate Systems\Projected Coordinate Systems\UTM\NAD 1983\NAD 1983 UTM Zone 16N.prj"

    # of processors to use set for max minus 1
    prc = int(os.environ["NUMBER_OF_PROCESSORS"]) - 1

    # Create and start a pool of worker processes
    pool = multiprocessing.Pool(prc)

    # Gets all layer in the Current MXD
    lyrs = arcpy.mapping.ListLayers(mxd)

    #Loops through every layer and gets source data name and path
    for lyr in lyrs:
        d = lyr.dataSource
        print "Passing " + d + " to processing pool"
        arcpy.AddMessage("Passing " + d + " to processing pool")
        pool.apply_async(worker(d))
Tags (2)
0 Kudos
14 Replies
StacyRendall1
Occasional Contributor III
Burt, a few things:

  • To make your code easier to view, when you paste it select all your code and click the # button at the top - this will present it as code, thus keeping your indentation.



  • Try printing prc and lyrs before the multiprocessing loop, with arcpy.AddMessage(str(prc)+' - '+str(len(lyrs))' - '+str(lyrs)), so you can check they are looking like you would expect...



  • I don't really know how exactly the Multiprocessing thing works, but I have been using it a bit. In my experience I think you need to add the jobs to a job server (just a Python list) and get a result from each worker. I have modified your code to make each worker return a string, and then print a message telling you when each worker is done.

import arcpy
import os
import multiprocessing

def worker(d):
 # buffer layer by the below values
 bfs = [101, 200, 201, 400, 401, 750, 751, 1001,
 1500, 1501, 2000, 2001, 2500]
 for bf in bfs:
  Output = os.path.basename(d)[:-4] + "_Buffer" + str(bf) + ".shp"
  print "Buffering " + d + " at " + str(bf) + " Feet"
  if arcpy.Exists(d):
   arcpy.Buffer_analysis(d, "D:\\Temp\\" + Output, str(bf) + " Feet")
   arcpy.Project_management("D:\\Temp\\" + Output, "D:\\Temp\\Test\\" + Output, "C:\Program Files (x86)\ArcGIS\Desktop10.0\Coordinate Systems\Geographic Coordinate Systems\North America\NAD 1983.prj")
   arcpy.Delete_management("D:\\Temp\\" + Output)
   
  else:
   print "No Data"
 
 # if it worked, return sucess
 resultString = 'Layer complete: %s' % (d) 
 
 return resultString


if __name__ == '__main__':

 #Sets MXD
 mxd = arcpy.mapping.MapDocument("D:\TEMP\Untitled4.mxd")
 #mxd = arcpy.mapping.MapDocument("CURRENT")

 #set some environments needed to get the correct outputs
 arcpy.env.overwriteOutput = True
 arcpy.env.workspace = "D:\TEMP\Test"
 arcpy.env.outputCoordinateSystem = "C:\Program Files (x86)\ArcGIS\Desktop10.0\Coordinate Systems\Projected Coordinate Systems\UTM\NAD 1983\NAD 1983 UTM Zone 16N.prj"

 # of processors to use set for max minus 1
 prc = int(os.environ["NUMBER_OF_PROCESSORS"]) - 1
 
 # Create and start a pool of worker processes
 pool = multiprocessing.Pool(prc)

 # Gets all layer in the Current MXD
 lyrs = arcpy.mapping.ListLayers(mxd)
 
 # print info of number of processors, number of layers, and the layer list
 arcpy.AddMessage(str(prc)+' - '+str(len(lyrs))' - '+str(lyrs))
 
 # start job server
 jobs = []
 
 #Loops through every layer and gets source data name and path
 for lyr in lyrs:
  d = lyr.dataSource
  print "Passing " + d + " to processing pool"
  arcpy.AddMessage("Passing " + d + " to processing pool")
  
  # add worker to job server
  jobs.append(pool.apply_async(worker(d)))
 
 # get 'results' from each worker in job server
 for job in jobs:
  arcpy.AddMessage(job.get())


Let me know how you get on.
0 Kudos
BurtMcAlpine
New Contributor III
Stacy,

Thanks for the help I worked on this for 2 days off and on, as I can benefit from several things by having a process or every layer in an mxd.    I gave up one mp and used the pp python. I was using your project as an example http://pythongisandstuff.wordpress.com/author/stacyrendall/.   I also had an install issue wiht ArcView, which compounded my issues.  The below code, which I just finished, buffers the layers in a saved mxd.  I ran it as a single thread and it took 420 seconds.  I ran using the pp libary and it took 151 seconds from python and 154 as a tool, so a significant improvement.  I never could get the jobs to start using mp, now that I have some understanding on the pp I think I will stick to it.

I tested your modification.  I it still runs as a single process...  Thanks for helping your web page was very useful

import arcpy
import os
import pp
import time

#forces script to run out of process no matter what
import win32com.client
gp = win32com.client.Dispatch("esriGeoprocessing.GpDispatch.1")


def wrker(d,bf):    #pp module everything in here get to run in multiprocesses
                    #much faster than a single thread

    # DO NOT USE arcpy.AddMessage() OR arcpy.addError()  IN THIS PART OF THE SCRIPT.
    #set enviroments for each pp process
    arcpy.env.overwriteOutput = True
    arcpy.env.workspace  = "D:\TEMP\Test"
    arcpy.env.outputCoordinateSystem = "C:\Program Files (x86)\ArcGIS\Desktop10.0\Coordinate Systems\Projected Coordinate Systems\UTM\NAD 1983\NAD 1983 UTM Zone 16N.prj"

    # Buffer distance in Feet
    distance = str(bf) + " Feet"

    #output file will go into the Workspace from above
    Output = os.path.basename(d)[:-4] + "_Buffer" + str(bf) + ".shp"

    #no .shp at the end of an In_memory feature
    Tempfile = "In_memory\\" + Output[:-4]

    Jobdone = d + " buffered by " + distance
    Jobfail = d + " failed to buffer by " + distance
    if arcpy.Exists(d):
        arcpy.Buffer_analysis(d, Tempfile, distance)
        arcpy.Project_management(Tempfile, Output, "C:\Program Files (x86)\ArcGIS\Desktop10.0\Coordinate Systems\Geographic Coordinate Systems\North America\NAD 1983.prj")
        arcpy.Delete_management(Tempfile)
        #print "Buffering " + d + " at " + distance
    else:
        print Jobfail

    #deletes In_memory temp file as to not hold ram.
    del Tempfile

    #data retruned to main code and used in messaging there
    return Jobdone

if __name__ == '__main__':

    #starts clock
    clck_st = time.clock()

    #Sets MXD can not use current for pp processing
    mxd = arcpy.mapping.MapDocument("D:\TEMP\Untitled4.mxd")

    #Gets all layer in the Current MXD
    lyrs = arcpy.mapping.ListLayers(mxd)

    #PP prep
    # of processors to use set for max minus 1
    prc = int(os.environ["NUMBER_OF_PROCESSORS"]) - 1

    #using Parellel python
    ppservers=()

    #sets workers to prc
    job_server = pp.Server(prc, ppservers=ppservers)
    #sets workers to the max available
    #job_server = pp.Server(ppservers=ppservers)

    print "Processing all Layers in the mxd"
    arcpy.AddMessage("Processing all Layers in the mxd")

    #list to hold all of the jobs
    jobs = []

    #List of buffer values
    bfs = [101, 200, 201, 400, 401, 750, 751, 1001,
           1500, 1501, 2000, 2001, 2500]

    #Loops through every layer and gets source data name and path
    for lyr in lyrs:
        d = lyr.dataSource
        arcpy.AddMessage("Passing " + d + " to processing pool")
        for bf in bfs:
            jobs.append(job_server.submit(wrker,(d,bf), (), ("arcpy","os","time")))
            #wrker(d)
            msg = "Passing " + d + " buffer at " + str(bf) + " Feet to processing pool"
            print msg
            arcpy.AddMessage(msg)

    # Retrieve results of all submited jobs
    for job in jobs:
        print job()
        arcpy.AddMessage(job())

    #script is finished
    print "Processing completed in " + str(int(time.clock() - clck_st)) + " seconds"
    arcpy.AddMessage("Processing completed in " + str(int(time.clock() - clck_st)) + " seconds")
    time.sleep(5)
0 Kudos
ChrisSnyder
Regular Contributor III
Guess there's only one sure way to tell, but�?�  Do you guys have any ideas on how memory usage is doled out and/or recycled for multiprocess or parallel python? In days of yore many of us were forced to create separate processes (a separate python .exe) using spawnv or subprocess. Primarily, this was simply because the gp object (or arcpy) had some really bad memory leak problems, and any gp-enabled application (python.exe) would gradually run out of memory when running tools in a loop. As an added bonus, you could write additional code to �??manage�?� the separate processes as a psudo-paralle process. These gp/arcpy memory leak issues have been improved greatly, but are still a significant problem for �??heavy duty�?� geoprocesing�?�

Anyway, I am curious if pp or multiprocess can somehow avoid these memory leak issues. In the pp example, I see that you hand the parallel process the same arcpy object that is instantiated in main. Do all the child processes use (and assumingly add to the memory consumption of) the same arcpy instance? In the mutiprocess I example, since I don�??t see such as explicit code as in the pp example, I assume the child processes by default just simply have access to all the global variables in main?

I am theorizing that pp and/or multiprocessing will still suffer from gradual and catastrophic gp/arcpy memory consumption, but I can�??t wait to try some stuff out for myself. Thanks a million for putting these examples up!!!
0 Kudos
BurtMcAlpine
New Contributor III
Chris in my PP example I use "In_memory", which uses part of your ram to hold data, to speed up the wrker functions which runs in parallel. Each "job" makes a unique "In_memory" object so make sure to delete it before the wrker function ends.  As for program memory leaks, I assume they are all still there and would be a much better issue if the code was run as single process, might be able to do a two part process to clean the memory out 1/2 way through if you run into memory issues, but I have not even come close on my stuff.

The pp code (only one I got to work) has to be run out of process,  so "prc" in my example sets the number of processes to start (5 for my intel 980X).  Each process is a python process (32bit) with its own memory allocation, just like using spawn or a sub process.  Then all of the jobs are fed to pool of the 5 processes and those same 5 processes run through the jobs until all of the jobs are finished. 

In my testing I have not had any memory issues.  If you use "In_memory" make sure not to load really big objects ( like a 1GB feature clas) in to that, also there are several limitations as to what can and cannot use "In_memory", but I use them all the time and they can really speed up the processing, some time by a factor of 4 times faster.

The mp.pool "spawns" new processes as well, but I never could get it to work correctly always wanted to run everything a single process.  It would start the pool, but only feed one process.  The PP is free and seems to work well, so that is what I will be using.

Please look at http://pythongisandstuff.wordpress.c.../stacyrendall/  for some known issues with mp and PP  it is towards the bottom of the page.
0 Kudos
StacyRendall1
Occasional Contributor III
Burt,

It is intriguing that Multiprocessing isn't working for you; can you confirm that it doesn't have an error, but just runs one process at a time? I use the Multiprocessing library exclusively now, as PP can't seem to handle extensions, but have never had any problems with it...

I don't know much about the specifics of memory use, but I have learned a few things (all with regard to using the Multiprocessing library, but they may shed some light on PP also). The child processes take a job from the job pool, complete it, take the next available job and so on. However, they seem to keep hold of some (not all) of the memory they were using; if you watch the processes over time the memory footprint of each will grow. Running some models involving Network Analyst with 4 child processes, and a worker pool of initially 56 items, I received Memory Error after the total footprint of all four processes had grown to about 4 GB, about half way through the pool. This was nowhere near the limits of the computer I was using. My best guess was that all the memory was still owned by the main process, and was thus hitting its 32 bit limit, but the numbers don't really add up to support that theory...

The easy way around this isn't fixed until Python 2.7, where you can set a maxtasksperchild value; although it doesn't seem to work very well - requiring more complicated code, which either hangs or runs quite slowly.

The hard way around is to recreate the pool every n times for each child, depending on how big the operations being performed by the workers are... Obviously it isn't pretty, but works just fine. I have used a number of computers and a mix of Windows 7 and XP while testing all these out, and it seems that sometimes on some computers the first parallel processes through the pool are a lot slower than the later ones, by at least 4 to 7x. If this is happening on a computer, you want n as high as possible, to minimise the number of your tasks that are first through the pool. If not, you might as well recreate the pool after each parallel process has run once.

Let me know if you want more info, and I'll try to get an example up on my blog sooner rather than later!

Cheers.
0 Kudos
BurtMcAlpine
New Contributor III
Stacy correct the mp code you modified runs the same as the code I originally posted.  That is the problem I was having originally it runs, but only using one child (one process)  so really non-parallel.  It makes the pool, but only uses one worker to process all of the jobs.  I just don't know what is wrong.  is there some setting to use mp that I am missing.  I would like to use mp as it seem to be "better" supported as in I can use arcpy.Adderror and arcpy.AddMessage.  If you make an mxd with a hand full of Shapefiles (has to be shapes for this example) and save it, then point mxd in the code to that mxd you can test it.  Also will need to adjust the workspace as am I using "D:\Temp". Does it run in parallel in your pc.

In case it matters... I am running win7 64-bit, Intel processor, and 12GB of RAM.  ArcGIS 10 SP2 with Python 2.6 (as it comes with ArcGIS) most of the patches after SP2 are installed as well.
0 Kudos
ChrisSnyder
Regular Contributor III
I received Memory Error after the total footprint of all four processes had grown to about 4 GB


Questions:

1) Are these individual processes that are uniquely discernable in the task manager? (example: 4 individual python.exe (or ???) processes?) Or a single instance?

2) If the latter,  I assume you are using the large address aware setting to get near 4GB (and a 64-bit OS, correct)? If so, have you noticed any drawbacks from using it?
0 Kudos
BurtMcAlpine
New Contributor III
1) each worker in the pool starts its own python.exe process and you can see them in the Taskmanager

2)I am not using largeaddressware.  Can you confirm that pyhton 2.6 has that capablility?   If so can you point me in that direction?

I do know that ArcGIS 10 uses largeaddressaware by default and that in the next release or 2  (arcGIS 10.1 or 11 SP1 is the estimate) ArcGIS will go to 64-bit GP environment (not the entire application just the Geoprocessing environment) I was told that at the ESRI UC 2011 by an ESRI employee.
0 Kudos
ChrisSnyder
Regular Contributor III
Large Address Aware: I just assumed it was, but... for 32-bit version of Python 2.x (3.x too?), I guess not!
http://bugs.python.org/issue1449496
Looks like Jason S. from ESRI has tried hard to make it so, but...

The non-ESRI people in the forum link said something to the effect of "why don't you just have your wrapper program (aka ArcGIS) be large address aware". So maybe when run via ArcWhatever, Python would "act large adress awary"?? Dunno...

"ArcGIS will go to 64-bit GP environment (not the entire application just the Geoprocessing environment) I was told that at the ESRI UC 2011 by an ESRI employee"


Wow! I had "heard" that only Server was gouing to be native 64 bit, but whatever... The geoprocessing stuff is the only thing I really care about where I think gobs of memeory would actually do soemthing cool, so.... Cool!
0 Kudos