Issues with multiprocessing and spatial analyst

JamesRamm · ‎02-27-2014

I have read every blog post and thread I can find on multiprocessing with arcpy and none of the fixes in them have fully addressed my problem.

I'm trying to do a relatively simple watershed calculation using multiprocessing.
The 'worker' function looks like this:

def multi_watershed(pnts, branchID, flowdir, flowacc, scratchWks):

    direc = tempfile.mkdtemp(dir = scratchWks) # If called in a pll process, needs to write to seperate directories
    arcpy.env.scratchWorkspace = direc    

    polylist = []
    for i, p in enumerate(pnts):
        pnt = arcpy.PointGeometry(arcpy.Point(p.x, p.y, ID=i))  #Convert the shapely point to an arcpy point
        pourpt = sa.SnapPourPoint(pnt, flowacc, 1000) 
        ws = sa.Watershed(flowdir, pourpt)  
        out = os.path.join(direc, "pol_%i"%i) #Generate a filename for the output polygon
        arcpy.RasterToPolygon_conversion(ws, out)
        polylist.append(out) #Append the output file to the list to be returned
    res = (branchID, polylist)
    return res

Given a list of points, it snaps the point to high flow accumulation, calculates the watershed and converts it to a polygon.
Using one process, this works fine.

I have a dictionary where each value is a list of points and I am trying to do the multiprocessing over this dictionary. The multiprocessing function looks like this:

def watershed_pll(data, flowdir, flowacc, tempfolder, proc=4):
    """ Calculate the watershed for each station point using parallel processing """
    pool = Pool(processes = proc)
    jobs = []
    for key, val in data.iteritems():
        jobs.append(pool.apply_async(multi_watershed, (val, key, flowdir, flowacc, temp)))    
    pool.close()
    pool.join()                   
    return jobs

It is as simple as can be and just returns the list of 'Apply_Result' objects. I then run this function from a script.
When using multiprocessing, sometimes it works, but more often than not I get one of these errors:

ERROR 010088: Invalid input geodataset (Layer, Tin, etc.).]

or

Unable to remove directory. Possible causes:
1- Not owner of the directory
2- Another person or application is accessing this directory

or even

FATAL ERROR(INFADI)
MISSING FILE OR DIRECTORY

There seems to be no pattern as to if/when these errors will occur and which one it will be...

Any ideas?

JamesRamm · ‎02-28-2014

Yes it is a pain. I have a version of the code without multiprocessing which I keep up-to-date with the multiprocessing version to check that I at least do not have any 'normal' bugs.

PeterWilson · ‎07-29-2014

Hi James and Cody

I'm currently busy with my Masters Thesis where I'm developing a geostatistical monte carlo model (conditional sequential guassian simulation) to measure uncertainty within stream networks derived from DEM using Arc Hydro D8 algorithm. My simulation requires a 100 simulations of my study area using Arc Hydro. I'm looking to multi-process the Flow Direction and Flow Accumulation process, but battling to figure out how to cut up the DEM as the following processes require the entire dem to generate the flow direction and flow accumulation. Any ideas or advice would be appreciated.

Regards

CodyScott · ‎07-29-2014

Peter,

Could you clarify the 100 simulations bit.

Are they 100 different versions of the Flow Direction/Accumulation, or is there only one base Flow Direction/Accumulation that your GA model uses?

Taking a guess, if your creating something that does flow direction/accumulation it would seem imperative that having the whole raster would be critical to determining the correct information for flow.

My other guess would be if you could determine where the watersheds are in the area, and break it down into multiple smaller sections you could split the DEM that way potentially.

Hard to say without having an understanding of the data though, so hopefully i can help more in a bit.

Cheers.

PeterWilson · ‎07-29-2014

Hi Cody

I'l try to explain the workflow of the Monte Carlo Simulation.

The Monte Carlo simulation takes a raw hydrological DEM\DTM as input and using statistics generates a new DEM\DTM that represents error, by adjusting the elevation values. The following is repeated a certain amount o times, the more the better to produce a better probability distribution of the error. Each DEM\DTM that is produced from the Monte Carlo simulation is then processed using Arc Hydro Terrain Preprocessing to generate a stream network from the DEM. The Spatial Analyst Hydrology Tools are the same tools (i.e. Flow Direction ; Flow Accumulation ; Stream Definition). My simulation requires that I generate 1000 simulations to produce a new version of the stream network. The 1000 version of the stream network are then compared to quantify where the uncertainty\error is found within the DEM by how many times the stream network is the same for each version and where it differs due to the change in heights introduced by the simulation.

The Flow Direction and Flow Acummulation processes are exceptionally computational and currently the tools don't make use of multiprocessing or the the additional memory available as part of the 64bit architecture. I'm looking for a way to split the DEM\DTM and process it using multiprocessing and stick it back together at the end of the process. As you mentioned if there was a way to identify the location of the catchments before hand one could split the DEM\DTM accordingly.

Regards

JamesRamm · ‎09-26-2014

My answer at the moment would be to look elsewhere than arcpy/arcgis for numerical modelling.

Perhaps arcobject would allow better control, but there are certainly faster algorithms out there that use less memory consumption than arc hydro's offerings.

D8 algorithms are not hard to program yourself and you do not need to look at a 'whole raster' at once. D8 algorithms only look at 9 cells at a time - the current cell and it's neighbours.

The difficulty is in how to manage edge cases.

I did a quick example a while ago for work showing that with python/cython and gdal, you could process large rasters (in a format supported by gdal) far quicker than archydro does it by streaming 3 rows at a time and shipping these out to parallel processes.