topic Re: Issues with multiprocessing and spatial analyst in Python Questions

Issues with multiprocessing and spatial analyst

JamesRamm — Thu, 27 Feb 2014 08:03:09 GMT

I have read every blog post and thread I can find on multiprocessing with arcpy and none of the fixes in them have fully addressed my problem.

I'm trying to do a relatively simple watershed calculation using multiprocessing.
The 'worker' function looks like this:

def multi_watershed(pnts, branchID, flowdir, flowacc, scratchWks):

    direc = tempfile.mkdtemp(dir = scratchWks) # If called in a pll process, needs to write to seperate directories
    arcpy.env.scratchWorkspace = direc    

    polylist = []
    for i, p in enumerate(pnts):
        pnt = arcpy.PointGeometry(arcpy.Point(p.x, p.y, ID=i))  #Convert the shapely point to an arcpy point
        pourpt = sa.SnapPourPoint(pnt, flowacc, 1000) 
        ws = sa.Watershed(flowdir, pourpt)  
        out = os.path.join(direc, "pol_%i"%i) #Generate a filename for the output polygon
        arcpy.RasterToPolygon_conversion(ws, out)
        polylist.append(out) #Append the output file to the list to be returned
    res = (branchID, polylist)
    return res

Given a list of points, it snaps the point to high flow accumulation, calculates the watershed and converts it to a polygon.
Using one process, this works fine.

I have a dictionary where each value is a list of points and I am trying to do the multiprocessing over this dictionary. The multiprocessing function looks like this:

def watershed_pll(data, flowdir, flowacc, tempfolder, proc=4):
    """ Calculate the watershed for each station point using parallel processing """
    pool = Pool(processes = proc)
    jobs = []
    for key, val in data.iteritems():
        jobs.append(pool.apply_async(multi_watershed, (val, key, flowdir, flowacc, temp)))    
    pool.close()
    pool.join()                   
    return jobs

It is as simple as can be and just returns the list of 'Apply_Result' objects. I then run this function from a script.
When using multiprocessing, sometimes it works, but more often than not I get one of these errors:

ERROR 010088: Invalid input geodataset (Layer, Tin, etc.).]

or

Unable to remove directory. Possible causes:
1- Not owner of the directory
2- Another person or application is accessing this directory

or even

FATAL ERROR(INFADI)
MISSING FILE OR DIRECTORY

There seems to be no pattern as to if/when these errors will occur and which one it will be...

Any ideas?

Re: Issues with multiprocessing and spatial analyst

DuncanHornby — Thu, 27 Feb 2014 09:29:35 GMT

Just a comment, how are you running this script? I have never been able to get multiprocessing to work in pyscripter (my IDE of choice). A collegue of mine did use multiprocessing successfully but ran it in IDLE.

Re: Issues with multiprocessing and spatial analyst

JamesRamm — Thu, 27 Feb 2014 09:54:40 GMT

Just a comment, how are you running this script? I have never been able to get multiprocessing to work in pyscripter (my IDE of choice). A collegue of mine did use multiprocessing successfully but ran it in IDLE.

Through spyder, but really the python console is a seperate process, so I doubt spyder is impacting anything.

I run other operations using arcpy and multiprocessing with no problem - the difference there is that they are only calling one tool; here there are 3 or 4.

My feeling is that arc is attempting to delete/move data inbetween operations that I haven't told it to....although I could be wrong.

Re: Issues with multiprocessing and spatial analyst

JamesRamm — Fri, 28 Feb 2014 09:33:23 GMT

I wonder if this be one of two potential problems:

1. Environments/folders getting mixed up. The input data is passed as complete filepaths to some raster datasets which are outside of the scratchWorkspace (which is set locally for each process) where intermediate/output data is created. However, I have noticed that Arc may make folders (typically and 'info' folder) in the directories of the input data. Why is that? Can it be prevented, or can I prevent Arc from then trying to delete it?

2. Are there any potential problems with accessing the input raster data sets at the same time? I.e each process will be attempting to open and read from the flow direction and flow accumulation rasters which are passed into the function.

Re: Issues with multiprocessing and spatial analyst

DuncanHornby — Fri, 28 Feb 2014 09:45:46 GMT

The info directory is an important part of an ESRI raster, you will not be able to prevent its generation. With regards to #2 just had an idea, if several processes connecting to the same data is the problem why not duplicate that data but with different names? It looks like you are attempting to use 4 cores so have 4 flow direction grids? Just an idea?

Re: Issues with multiprocessing and spatial analyst

JamesRamm — Fri, 28 Feb 2014 10:09:04 GMT

Yes that is worth trying; it will at least confirm whether that is the source of the error.

Re: Issues with multiprocessing and spatial analyst

JamesRamm — Sun, 12 Dec 2021 05:15:27 GMT

I believe I have solved the issue. I have just done 6 tests with no error...

However, I have not solved the problem of why the issue occurred.

My original 'worker' function called a number of arcpy commands in sequence:

- first it converted a point to an arcpy point using Point and PointGeometry
- It called the SnapPourPoint tool which output a temporary raster
- It called the Watershed tool, which output another temporary raster
- It called the RasterToPolygon_conversion tool to create a Polygon object

I edited the original code so that the only tool used with parallel processing is the Watershed tool (which is most time consuming).

The point conversion and snap pour point tool is called in its own loop separately and the results stored to disk.

The watershed calculation is then performed using parallel processing and the resulting rasters stored to disk.

The conversion is then performed in a loop on these results. Finally, all those intermediate files

The parallel code now looks like this:

def watershed_pll(data, flowdir, flowacc, tempfolder proc=4):
    """ Calculate the watershed for each station point using parallel processing """
    
    # Preprocessing of data points for watershed calculation
    pnts = []    
    for key, val in data.iteritems():
        pnts.append(prepare_point(val, flowacc, key, tempfolder))
    pp = dict(pnts)
    
    # Do the watershed calc in parallel
    pool = Pool(processes = proc)
    jobs = []        
    for key, val in pp.iteritems():
        jobs.append(pool.apply_async(pll_watershed, (val, key, flowdir, flowacc, temp)))    
    pool.close()
    pool.join()  

    # Converte watershed rasters to polygons
    ws_rast = dict([job.get() for job in jobs])
    res = []
    for key, val in ws_rast.items():
        res.append(af.ws_poly(val, key, mod.temp))
    pols = dict(res)
    return pols

It looks much bigger - there are now 3 loops instead of one. There is also a lot of dictionary formatting etc to preserve the results format/order for the next loop. This is all a bit more expensive, but happily, the watershed tool is by far the most time consuming and is parallelised. So there is still a significant speed up from single processing.

Hopefully, the other 2 loops can also be parallelised independently to give another speed up... I'll look into that next.

My only worry is that the intermediate files cannot be deleted each iteration as they need to be used in the next loop..this could potentially be a problem with big data..

I am also unsure what caused the initial problem in the first place, which would be good to know.

Re: Issues with multiprocessing and spatial analyst

CodyScott — Fri, 28 Feb 2014 16:04:59 GMT

In my experience with multiprocessing and pyscripter i have learned a few things about this.

Similar to you i had a nightmare trying to get scripts to complete correctly, they would often power through 200 rasters or so then simply stop for no reason...

The solution i had determined in my case (with rasters) was that if i attempted to write the rasters all to the root folder then it would fail. Thus, i simply had each process generate a new folder with a number on the end which would then become the destination folder.

After all my rasters had been processed you can then step through the folders and collect them all into a single merged file at the end.

This was the only solution that i could find that worked. The other one was creating "in_memory" versions of the base data each time a process ran. You'll need to manage the data and make sure you delete the files created otherwise you'll run out of memory.

I can toss the code up if that would be helpful..

This: http://blogs.esri.com/esri/arcgis/2012/09/26/distributed-processing-with-arcgis-part-1/
and this: http://pythongisandstuff.wordpress.com/2013/07/31/using-arcpy-with-multiprocessing-%E2%80%93-part-3/

were also dead useful

Re: Issues with multiprocessing and spatial analyst

JamesRamm — Fri, 28 Feb 2014 16:25:02 GMT

Code would be helpful.

I was already generating a new workspace folder for each process so that is not the problem..

Unfortunately my solution above is not the total answer. I have found that for larger data, it will still fail, but usually with an ArcGIS error code stating that it is unable to execute the tool or something (will have to wait til monday to get the exact code).

So there is still a problem somewhere...

Re: Issues with multiprocessing and spatial analyst

CodyScott — Fri, 28 Feb 2014 16:31:04 GMT

The other thing that i also remember from my experience is debugging multiprocess is a pain too.

I couldn't debug when running asynchronously, i had to work the kinks of the code before then hope that the multi ran fine.

Re: Issues with multiprocessing and spatial analyst

JamesRamm — Fri, 28 Feb 2014 18:45:42 GMT

Yes it is a pain. I have a version of the code without multiprocessing which I keep up-to-date with the multiprocessing version to check that I at least do not have any 'normal' bugs.

Re: Issues with multiprocessing and spatial analyst

PeterWilson — Tue, 29 Jul 2014 19:10:04 GMT

Hi James and Cody

I'm currently busy with my Masters Thesis where I'm developing a geostatistical monte carlo model (conditional sequential guassian simulation) to measure uncertainty within stream networks derived from DEM using Arc Hydro D8 algorithm. My simulation requires a 100 simulations of my study area using Arc Hydro. I'm looking to multi-process the Flow Direction and Flow Accumulation process, but battling to figure out how to cut up the DEM as the following processes require the entire dem to generate the flow direction and flow accumulation. Any ideas or advice would be appreciated.

Regards

Re: Issues with multiprocessing and spatial analyst

CodyScott — Tue, 29 Jul 2014 20:20:37 GMT

Peter,

Could you clarify the 100 simulations bit.

Are they 100 different versions of the Flow Direction/Accumulation, or is there only one base Flow Direction/Accumulation that your GA model uses?

Taking a guess, if your creating something that does flow direction/accumulation it would seem imperative that having the whole raster would be critical to determining the correct information for flow.

My other guess would be if you could determine where the watersheds are in the area, and break it down into multiple smaller sections you could split the DEM that way potentially.

Hard to say without having an understanding of the data though, so hopefully i can help more in a bit.

Cheers.

Re: Issues with multiprocessing and spatial analyst

PeterWilson — Wed, 30 Jul 2014 05:02:01 GMT

Hi Cody

I'l try to explain the workflow of the Monte Carlo Simulation.

The Monte Carlo simulation takes a raw hydrological DEM\DTM as input and using statistics generates a new DEM\DTM that represents error, by adjusting the elevation values. The following is repeated a certain amount o times, the more the better to produce a better probability distribution of the error. Each DEM\DTM that is produced from the Monte Carlo simulation is then processed using Arc Hydro Terrain Preprocessing to generate a stream network from the DEM. The Spatial Analyst Hydrology Tools are the same tools (i.e. Flow Direction ; Flow Accumulation ; Stream Definition). My simulation requires that I generate 1000 simulations to produce a new version of the stream network. The 1000 version of the stream network are then compared to quantify where the uncertainty\error is found within the DEM by how many times the stream network is the same for each version and where it differs due to the change in heights introduced by the simulation.

The Flow Direction and Flow Acummulation processes are exceptionally computational and currently the tools don't make use of multiprocessing or the the additional memory available as part of the 64bit architecture. I'm looking for a way to split the DEM\DTM and process it using multiprocessing and stick it back together at the end of the process. As you mentioned if there was a way to identify the location of the catchments before hand one could split the DEM\DTM accordingly.

Regards

Re: Issues with multiprocessing and spatial analyst

JamesRamm — Fri, 26 Sep 2014 10:01:05 GMT

My answer at the moment would be to look elsewhere than arcpy/arcgis for numerical modelling.

Perhaps arcobject would allow better control, but there are certainly faster algorithms out there that use less memory consumption than arc hydro's offerings.

D8 algorithms are not hard to program yourself and you do not need to look at a 'whole raster' at once. D8 algorithms only look at 9 cells at a time - the current cell and it's neighbours.

The difficulty is in how to manage edge cases.

I did a quick example a while ago for work showing that with python/cython and gdal, you could process large rasters (in a format supported by gdal) far quicker than archydro does it by streaming 3 rows at a time and shipping these out to parallel processes.