Parallel Processing Travel Cost Matrices

AndreasSantucci · ‎01-24-2014

Dear ArcGIS Community,

As part of my research, I am creating travel cost matrices for ~30k origin locations and ~3k destination locations. Calculating the travel cost matrices for such a large number of origins and destinations is quite time consuming, and currently takes several days to compute. My current workflow is as follows:

- Start with two ".csv" files, containing the latitude and longitude coordinates for origins and destinations.
- Using Python, I create ".lyr" files which contain a street map, origins, destinations, and relevant options.
- Using a ".bat" script, I call the executable file "GenerateCSVMatrixFromOD.exe" and apply it to each ".lyr" file created in the preceding step.

Because there are so many origins and destinations, I am forced to split up my data into chunks before processing. Currently, I am splitting my data into 6 chunks, where each chunk contains all ~30k origin locations and approximately 500 destination locations. Currently, executing the process above for each of the 6 ".lyr" files takes a total of about 3 days computing time. In the near future, I wish to do another experiment which involves calculating, for the same set of ~30k origin locations, the travel cost to almost ~40k unique destinations. At the current rate, this process will take weeks.

My main question is this: is it possible to speed up this process? I could envision opening up multiple ".bat" scripts to process each chunk simultaneously, but I'm not sure how ArcGIS would handle having multiple instances running at the same time, and whether there would be issues with sharing the same geo-database, or pulling from the same set of physical CPU resources. I could also envision partitioning my data in a different, and possibly more efficient way.

What's the best way to speed up this process?

I am happy to provide any additional details that would help to better inform. Thank you for your time.

RamB · ‎01-26-2014

If you are good in python. Why would you not want to stay within python (outside ArcGIS)

Please see this

regards,

MelindaMorang · ‎01-27-2014

Hi Andreas.

There are a couple of ways to speed up lengthy network calculations:

1) Install the 64-bit Background Geoprocessing product. This way, when you run ArcToolbox tools in Desktop, if you have background geoprocessing enabled (in Geoprocessing->Options), the tool will run in 64-bit mode, which speeds things up quite a bit. Note that you have to be running the ArcToolbox version of Solve instead of clicking the little Solve button on the Network Analyst toolbar. You can also run python script tools using 64-bit background GP by simply calling the 64-bit version of python. The documentation explains it in more detail: http://resources.arcgis.com/en/help/main/10.2/index.html#//002100000040000000.

2) You can use python�??s multiprocessing module to run several analyses in parallel. This is extremely helpful for a problem like yours that can be broken up into chunks. I have some examples I can share if you�??re interested. Setting up the multiprocessing is simple in principle, but it doesn't always play well with ArcGIS, so it�??s a bit tricky. You can't write to the same output geodatabase from multiple processes at once (it's a schema lock issue), and you can't share the same NA layer across processes. It works best to split up the problem into chunks first, and then your multiprocessed function can create a new layer, add the locations, solve it, and write the results. Then, back in your main script, do something to combine the results in the way you want.

Hope this helps.

AndreasSantucci · ‎01-27-2014

Hi mmorang,

Thank you for your suggestions. The 64-bit background geoprocessing product seems like it will help a lot, as well as the parallel processing. It looks like I might be out of luck for the 64-bit processing unless I upgrade my license; I'm currently on ArcGIS 10.0

I understand there are a few caveats (using the right solve function, and setting up an embarrassingly parallel process), and with this in mind it would be extremely helpful to see some code examples.

Right now, I am creating ".lyr" files in python, and then using a ".bat" script to feed these to the executable GenerateCSVMatrixfromOD executable file, available here: http://www.arcgis.com/home/item.html?id=00a6a60a32b54393996f0a8ce9ee02c9

However, it seems that maybe this process can be wholly contained within python?

Thank you so much for your time!

MelindaMorang · ‎01-27-2014

Hi Andreas. Yes, it looks like that executable just does some stuff you could write yourself in python without too much trouble. You just have to grab the output OD Lines, run a search cursor, and print each row to the CSV file using python's simple csv module. Or, you can just continue using that executable. I think it would still work with multiprocessing.

Also, before you continue, please note that multiprocessing isn't going to help you much if your machine has only 2 cores or something, so think about that before spending a lot of time on this.

Here's a little outline script showing how you can set up multiprocessing. Let me know if this isn't clear or you need more detail.

import multiprocessing
import csv
import arcpy

def CalculateOD(stuffPassed):
    '''Calculate the OD matrix and write the results to a CSV'''
    
    ## Generate the input layer. You can read in a saved layer template
    ## add add the appropriate origins and destinations for this chunk
    
    ## Solve the layer
    
    ## Call the external Save-to-CSV code or write your own code
    
    ## Return the path to the CSV
    

# ---- Main code ----
def main():

    ## Do whatever you need to prepare your analysis
    ODChunks = []
    ## Fill your list of OD chunks somehow

    # Pass this info as inputs to the multiprocessing function.
    stuffToPass = []
    for chunk in ODChunks:
        if chunk:
            idx = ODChunks.index(chunk)
            # I'm passing the chunk index because you can use that to name the output file
            stuffToPass.append([chunk, idx, OTHERVARIABLES])

    # Do the multiprocessing. It could return the paths to the output csvs if you want.
    pool = multiprocessing.Pool()
    OutCSVs = pool.map(CalculateOD, stuffToPass)
    pool.close()
    pool.join()
    
 
    # Combine the output csvs
    for CSVFile in OutCSVs:
        ## Open each CSV using python's csv module
        ## The first line is probably the field headers, so discard that after the first time
        ## Dump the rows out into one giant table.
    
    
if __name__ == '__main__':
    main()

Note: You can't execute a multiprocessed code like this from within a python IDE. You have to run it from a command line. Also, you have to have the if __name__ == '__main__': syntax with the main() function.

JaySandhu · ‎01-28-2014

I would suggest that instead of spiting based on destinations, split the data based on origins. So do a 1K origins by 3K destinations and repeat 30 times instead of 30K by 500 destinations. Reason being the OD solver is solving 1 by N as a single solve so i is more efficient to solve for all destinations from origins.

Also what is the distribution of the data? That is are all the locations in a single city? Say within a 20 or 30 miles or spread out more? Is hierarchy on or off? If the data is confined to a small area like 30 miles then turning off hierarchy and placing a cutoff of say 50 miles will speed up the processing.

Jay Sandhu