MakeFeatureLayer fails in multiprocessing python

GeorgZweyer · ‎07-21-2014

Hello everyone,

--- Update -------------------------------

See later posts for specific problem description.

i try to use python multiprocessing to do some processing on chunks of a feature class. I changed the third example from this esri post: Multiprocessing with ArcGIS – Approaches and Considerations (Part 1) | ArcGIS Blog‌

If i set the poolsize to 1 it runs correctly but if i use more than one process "MakeFeatureLayer" throws a "Cannot open file" or "Dataset does not exist or is not supported" error most of the time for every process but the first one. (I use "MakeFeatureLayer" on three different Files in every process and it fails at one of them for sure)

It seems to be an error related to accesing the same file at the same time out of different processes, because inserting a waiting time befor starting the "MakeFeatureLayer" methods seems to work. Because this problem is not mentioned in the esri post or somewhere else in the www i think it must be my mistake but i just do the same thing like esri in their third example of the mentioned post 😕 The only thing mentioned in the blogpost is that you cannot use multiprocessing to update data in a gdb but "MakeFeatureLayer" needs only read access, doesn't it?

def worker(ranges):
    i = ranges[0]
    j = ranges[1]
    try:
        arcpy.AddMessage("started processing thread ")
        arcpy.AddMessage(ranges)
        arcpy.CheckOutExtension("3D")
##        time.sleep(1*(ranges[2]-1))
        samples=arcpy.MakeFeatureLayer_management(Input_Point_Features, "point_layer{0}".format(i), "OBJECTID >= {0} AND OBJECTID <= {1}".format(i, j))

anyone some suggestions?

thx

JoshuaBixby · ‎07-23-2014

At least for me, it is hard to provide much feedback given the limited code snippet.

Are you executing this code through the interactive Python window in ArcGIS Desktop?

GeorgZweyer · ‎07-24-2014

Hello Joshua,

no it runs as a script tool and out of process.

so this is a edited version of the example from the esri blog post i used. (my code is much more complex now but i can explain my problem better with that snipplet)

  importmultiprocessing
  import numpy
  import arcpy
  def worker(ranges):
    i, j = ranges[0], ranges[1]
    lyr = arcpy.management.MakeFeatureLayer(
                      Input_Point_Features,
                      "point_layer{0}".format(i), 
                      "OBJECTID >= {0} AND OBJECTID <= {1}".format(i, j))
    result_table = calculatesomething(lyr)
    result_array = arcpy.da.TableToNumPyArray(result_table, ["*"])
    arcpy.management.Delete(gn_table)
    return result_array
    # End generate_near_table function

def main():
    ranges = somegeneratetvalues

    pool = multiprocessing.Pool()
    result_arrays = pool.map(worker, ranges)

    # Concatenate the resulting arrays and create an output table
    # reporting any identical records.
    result_array = numpy.concatenate(result_arrays,axis=0)
    arcpy.da.NumPyArrayToTable(result_array, "mytargetlocation")

    # Synchronize the main process with the job processes to
    # Ensure proper cleanup.
    pool.close()
    pool.join()
    # End main

if __name__ == '__main__':
    main()

The problem is:

MakeFeatureLayer fails if two or more processes try to run it on the same data at the same time.If i run it with two processes and insert a few seconds waiting time for one of them everything is fine. But every example i could find on the internet used MakeFeatureLayer and multiprocessing equaly to the code above and none of them used some sort of mutual exclusion.

(I'm using multiprocessing.Lock right now to fix the problem and it works most of the time. Still sometimes it fails. I think the cause of this is ArcScene running and occasionally accessing the data in the same moment i try to use MakeFeatureLayer on it, but i'm not sure about it)

JoshuaBixby · ‎07-24-2014

What happens if you set maxtasksperchild=1 when instantiating the pool?

I realize this isn't a good idea operationally since it will kill performance, especially on Windows, but it will be interesting to see how it affects the errors.

JoshuaBixby · ‎07-24-2014

Also, how big are the data sets you are working with? Would it be possible for each child process to make an in-memory copy of the data set, or possibly a temporary copy on disk?

If there is some kind of file locking occurring, even though you aren't editing the data, maybe giving each child process its own copy of the data to work on would work around the issue. Clunky, but you will still need a workaround if this is being caused by a bug.

ChrisMathers · ‎07-25-2014

The locking could be a cause but if no edits are being made it shouldnt lock out just reading. Locks are made at the process level so multi-threading is often fine while multi-processing isn't in cases such as shapefiles or FGBDs where there is not multi-user editing functionality. Try using threads instead of processes and see what happens. I usually use threading for that reason, you only show up once in the database connection log so you cant lock yourself out.

JoshuaBixby · ‎07-25-2014

I have generally had more success scaling performance with multiprocessing than with multithreading, but a lot depends on the extensions/packages being used and specific structure of the code.

I believe ArcPy requires CPython. Unless functions are specifically written to address the GIL, which CPyton has, multithreading code won't always perform as people might expect. But again, it depends on the specifics. And, if locking I/O is problematic and multithreading can address it, it beats multiprocessing code that doesn't run at all.

GeorgZweyer · ‎07-28-2014

Hello Joshua and Chris,

first of all thanks for your answers.

What happens if you set maxtasksperchild=1 when instantiating the pool?

It happens while the workers work on the first set of tasks so this won't help

Also, how big are the data sets you are working with? Would it be possible for each child process to make an in-memory copy of the data set, or possibly a temporary copy on disk?

Well the MakeFeatureLayer function calls are for exactly this purpose. Calling some function that have to read the same data synchronously failed in the past so i came up with this. (This shouldn't have happened, too, cause all these functions also just need reading access to the data)

I can try to make a copy "in_memory" rather than a layer.

Try using threads instead of processes and see what happens.

I could try this, but it would not be a solution to the problem cause i implemented multiprocessing to make use of multiple processor cores. Multithreading doesn't offer that as far as i know in the Python implementation ArcGIS uses.

One workaround i have in mind is to catch the exception earlier and let the process try the data access once more.

Also i ran the script as a standalone python script (with ArcGIS completly closed) for an hour without an error wich is a clue for my "ArcScene running is occasionally accessing the data in the same moment i try to use MakeFeatureLayer on it" theory to be correct.

Is there a chance that someone from esri stumbles upon this thread and give an answer on the expected behavior of the "MakeFeatureLayer" function or do i have to contact tech support directly?

JoshuaBixby · ‎07-28-2014

If you think ArcScene is actually causing the problem, in terms of locking the file, I wonder what would happen if you went into Windows Explorer and made the files in question read-only. It could be it makes everything worse, or it might prevent ArcScene from exclusively accessing the file when other programs are trying to read it. Simple enough to try.

GeorgZweyer · ‎07-29-2014

Hello again,

thanks for the participation so far. Sadly my problem still persists. I still think that ArcScene is causing it somehow but not that simultaneous file access from ArcScene is the cause anymore. I inserted a retry mechanism that trys the data access two additional times after two 10 seconds breaks, but the data access fails everytime. I looked into the folder of the gdb and could not find any abnormal lock files there. One sr.lock for every data source and every process and one rd.lock everytime data is acually read.

Furthermore, it looks like this problem only occures if the script is started out of ArcScene, not if ArcScene is just running like i thought before, but i am not sure about that, because there is no way to reliable reproduce the error. So starting the script out of a command line window seems to work correctly.

I also noticed that in addition to the "Cannot open file" and "Dataset does not exist or is not supported" errors, sometimes a "Error executing function" error is thrown.

One more point. Runing the script out of ArcScene with only one worker process seems to work correctly, again this is only an asumption, but it is not the case that there is always one worker thread which is fine. I had one test with 4 worker processes and 2 of them crashed during the handling of their first task and the other 2 during their second.

I am pretty frustrated right now and i am running out of ideas. The only solution for me now seems to be to only use the script out of the command line.

As i am writing this it comes to my mind that i am using layers as input if i run the script from inside ArcScene which is not the case if i run it from command line, so maybe there is a problem with using "MakeFeatureLayer" on layer?

Sorry for any language errors.