Script running slow, how to speed up?

LauraBlackburn · ‎06-03-2015

I have a script that is creates a buffer around points, converts the round buffers to squares, clips a road layers with the square, then calculates the length for the clipped roads and exports data to a csv file. I have run this script before and it took roughly 3 days to complete. This time when I ran the code it's projected to take a month to mine the data. I have no idea why it might be slowing down and could really use some help figuring this out. Does anyone know how I could speed up this process?

# -*- coding: utf-8 -*-
# ---------------------------------------------------------------------------
# 2_TrapStatsAndFloatingRoadDensity.py
# Altered on: 2015-6-2 10:00 am
# Created by: L.M.Blackburn
# ArcGIS 10.1
# Notes: iterates thru orders shapefiles and creates a buffer
#        around each point. The round buffer is converted to a feature envelope
#        which makes a square polygon around the point. This square is used to
#        clip the roads layer. Then I calculate length for the clipped roads by adding
#        a field and calculating geometry. Run statistics to get the sum length
#        of roads in each square. export sum length, block name, centroid of
#        square out to a text file
#        (2nd script for 4th part of analysis)
# Needed data: Order sublayer from VRP
#              List of fields to drop
#              Roads network to get length data               
# ---------------------------------------------------------------------------

# Import arcpy module
import arcpy, os, sys, traceback, csv

# Set the necessary product code
import arceditor
import arcinfo

# Set workspace environment - this is where the solved route layers are located
arcpy.env.workspace = 'F:\\Workspace\\Sandy\\GM_costAnalysis\\analysisMay15cont\\Orders\\'
arcpy.env.overwriteOutput = True

# --------------------Set local variables ------------------
# ---workspace for order layers
ordersLyr = 'F:\\Workspace\\Sandy\\GM_costAnalysis\\analysisMay15cont\\Orders\\'
# ---workspace for buffer layers
bufferLyr = 'F:\\Workspace\\Sandy\\GM_costAnalysis\\analysisMay15cont\\Buffers\\'
# ---set default buffer distance
bufDefault = "6000 Meters"
# ---workspace for envelope layers
envelopeLyr = 'F:\\Workspace\\Sandy\\GM_costAnalysis\\analysisMay15cont\\Envelopes\\'
# ---workspace for route layers
routeWkspc = 'F:\\Workspace\\Sandy\\GM_costAnalysis\\analysisMay15cont\\Routes\\'
# ---set workspace path for clipped roads
outDataPath = 'F:\\Workspace\\Sandy\\GM_costAnalysis\\analysisMay15cont\\ClipRoads\\'
# ---set road layer - for use in clip & length
Road = 'F:\\Workspace\\Sandy\\GM_costAnalysis\\RoadsForSnap\\SDC Edge Source.shp'
# ---fields to delete from Orders layers
dropFields = ["Descriptio", "ServiceTim", "TimeWind_1", "TimeWind_2", "TimeWind_3", "TimeWindow", "MaxViolati", "MaxViola_1", \
              "PickupQuan", "DeliveryQu", "Revenue", "SpecialtyN", "Assignment", "RouteName", "ViolatedCo", "CumulTrave", \
              "CumulDista", "CumulTime", "ArriveCurb", "DepartCurb", "DepartTime", "WaitTime", "ViolationT", "CumulWaitT", \
              "CumulViola", "ArriveTime", "SourceID", "SourceOID", "PosAlong", "SideOfEdge", "CurbApproa", "Status"]
              
try:
    print 'Opening CSV file...'
    # Step 1 - Create table for output data
    # Needed fields: From script [Block, Scale, Sequence, FromPrevTr, FromPrevDi, totTime, totDist, x, y, totalRdLength]
    CSVFile = 'F:\\Workspace\\Sandy\\GM_costAnalysis\\analysisMay15cont\\AllTrapsData_June2.csv'
    f = open (CSVFile, "wb")
    w = csv.writer(f, delimiter=',', lineterminator='\n')
    fieldNames = ['Block', 'Scale', 'BufferType', 'Sequence', 'FromPrevTr', 'FromPrevDi', 'x', 'y', 'bufferDist','totTime', \
                  'totDist', 'totalRdLength', '\n']
    w.writerow(fieldNames)

    # Step 2 - Loop through files in orders folder
    # Create variable to hold orders feature class files
    '''
    orderList = arcpy.ListFiles("*.shp")

    # Loop through layers & break name into segments to be used when naming orders lyrs
    print 'Creating Buffers...' 
    for shp in orderList:
        lyrs = arcpy.mapping.Layer(ordersLyr + shp)
        splitName =shp.split('_')
        block = splitName[0]
        scale = splitName[1].rstrip('.shp')

        # define buffer distance 
        if scale == "1K":
            bufDist = "1000 Meters"
        elif scale == "2K":
            bufDist = "2000 Meters"
        elif scale == "3K":
            bufDist = "3000 Meters"
        elif scale == "4K":
            bufDist = "4000 Meters"
        elif scale == "6K":
            bufDist = "6000 Meters"
        elif scale == "500":
            bufDist = "500 Meters"
        else:
            bufDist = bufDefault
   
        print 'Order: ' + shp + ', Buffer: ' + bufDist
        # Jump into orderLayers to delete unneeded fields & add XY
        arcpy.DeleteField_management(shp, dropFields)
        arcpy.AddXY_management(shp)
                    
        # Step 4 - Run buffer on order layers     
        arcpy.Buffer_analysis(shp, bufferLyr + 'buf' + '_' + block + '_' + scale, bufDist, 'FULL', 'ROUND', 'NONE', '#')
        arcpy.Buffer_analysis(shp, bufferLyr + 'buf2' + '_' + block + '_' + scale, bufDefault, 'FULL', 'ROUND', 'NONE', '#')
    
    bufferList = []
    for dpath, dnames, fnames in arcpy.da.Walk(bufferLyr, datatype = 'FeatureClass', type = 'Polygon'):
        for files in fnames:
            bufferList.append(os.path.join(files))
    
    # Step 5 - Run feature envelope to polygon on buffer layers
    print 'Converting round buffers to squares...' 
    for bufShp in bufferList:
        bufSplitName = bufShp.split('_')
        blockID = bufSplitName[1]
        scaleID = bufSplitName[2].rstrip('.shp')
        bufID = bufSplitName[0]
        print 'Buffer: ' + bufShp + ' Block: ' + blockID + ' Scale: ' + scaleID + ' BufferType: ' + bufID

        arcpy.FeatureEnvelopeToPolygon_management(bufferLyr + bufShp, envelopeLyr + bufID + '_' +  blockID + '_' +  scaleID, 'SINGLEPART')
    '''
    # Step 6 - Calculate totalTime and totalDistance using insert cursor
    # loop through each record in the table - calculate values and...
    # use selected features to clip roads layer & calculate geometry
    # add values to the CSV table
    print 'Populating CSV file...'
    envelopeList = []
    for dirpath, dirnames, filenames in arcpy.da.Walk(envelopeLyr, datatype = 'FeatureClass', type = 'Polygon'):
        for filename in filenames:
            envelopeList.append(os.path.join(filename))
    for eLayer in envelopeList:
        eLayerName = eLayer.split('_')
        blockEID = eLayerName[1]
        scaleEID = eLayerName[2].rstrip('.shp')
        bufEID = eLayerName[0]
        print 'eLayer: ' + eLayer + ' Block: ' + blockEID + ' Scale: ' + scaleEID + ' BufferType: ' + bufEID
        eLyrPath = envelopeLyr + eLayer
        
        #Make a layer from the feature class - needed for selecting records to be used in the clipping
        arcpy.MakeFeatureLayer_management(eLyrPath, "clipLayer")
        
        # use search cursor to grab data row by row to populate CSV file
        with arcpy.da.SearchCursor(eLyrPath, ("Sequence", "FromPrevTr", "FromPrevDi", "POINT_X", "POINT_Y",
                                                          "BUFF_DIST")) as cursor:
            for row in cursor:
                dataList = []
                dataList.append(str(blockEID))
                dataList.append(str(scaleEID))
                dataList.append(str(bufEID))
                dataList.append(str(row[0]))
                dataList.append(str(row[1]))                
                dataList.append(str(row[2]))
                dataList.append(str(row[3]))
                dataList.append(str(row[4]))
                dataList.append(str(row[5]))
                n = int(row[0])
                sN = "=" + str(n)
                n2 = n +1
                sel = "=" + str(n2)
                prevTime = float(row[1])
                prevDist = float(row[2])
                # print sN
                postTime = ""
                # use a nested search to get values for total time and total distance
                # there are some cases where the below expression will be false in that
                # event we will need to cacluclate the distance to the depot or just leave
                # the values for prevTime & prevDist as place holders, otherwise I end up with
                # road length being written in the wrong field
                expression = arcpy.AddFieldDelimiters(eLyrPath, "Sequence")+ sel
                for row2 in arcpy.da.SearchCursor(eLyrPath, ("FromPrevTr", "FromPrevDi"), expression):
                    postTime = row2[0]                    
                    totTime = prevTime + float(postTime)
                    # print totTime
                    dataList.append(str(totTime))
                    postDist = row2[1]
                    # print postDist
                    totDist = prevDist + float(postDist)
                    # print totDist
                    dataList.append(str(totDist))

                # if above for loop yeilds no results, calculate postTime and postDist differently
                # grab DepotVisits layer - if VisitType = "End" then grab values for "FromPrevTravelTime" & "FromPrevDistance"
                if postTime == "":  
                    
                    #select corresponding route layer file and grab sublayer: DepotVisits
                    print "extracting time & distance to Depot"
                    routeLyr = arcpy.mapping.Layer(routeWkspc + "TrapRoute_" + blockEID + "_" + scaleEID + ".lyr")
                    if routeLyr.isGroupLayer:
                        for sublyrs in routeLyr:
                            # print sublyrs.name
                            if sublyrs.name == 'Depot Visits':
                                arcpy.MakeFeatureLayer_management(sublyrs, "depotVisits")
                                qLast = "= 2"
                                expression2 = arcpy.AddFieldDelimiters("depotVisits", "VisitType")+ qLast
                                for row3 in arcpy.da.SearchCursor("depotVisits", ("FromPrevTravelTime", "FromPrevDistance"), expression2):
                                    postTime2 = row3[0]
                                    totTime2 = prevTime + float(postTime2)
                                    # print ("Depot Visit Time: {0}, {1}, {2}".format(prevTime, postTime2, totTime2))
                                    dataList.append(str(totTime2))
                                    postDist2 = row3[1]
                                    totDist2 = prevDist + float(postDist2)
                                    # print ("Depot Visit Distance: {0}, {1}, {2}".format(prevDist, postDist2, totDist2))
                                    dataList.append(str(totDist2))
               

                # select single record to use when clipping - this selection must be on a layer not fc
                # always create where clause using AddFieldDelimiters
                seqField = "Sequence"
                where_clause = arcpy.AddFieldDelimiters("clipLayer", seqField)+ sN
                arcpy.SelectLayerByAttribute_management("clipLayer", "NEW_SELECTION", where_clause)
                
                # Clip roads w/ selected polygon - for this I will need a search cursor
                clipRoads = outDataPath + 'rdsClip_' + blockEID + '_' + scaleEID + '_' + bufEID + '_' + str(n)
                arcpy.Clip_analysis(Road, "clipLayer", clipRoads)
                clipRoadsShp = clipRoads + ".shp"        

                # Use geometry/length to get the total length of clipped roads(in meters)
                g = arcpy.Geometry()
                geometryList = arcpy.CopyFeatures_management(clipRoadsShp, g)
                length = 0
                for geometry in geometryList:
                    length +=geometry.length

                # append length (meters) at end of line to csv dataList
                dataList.append(str(length))
                # dataList.append('\r\n')
                # print length 
                
                # Write dataList to csv file ---- may need to dedent this one more time ---
                w.writerow(dataList)
                # print dataList
        
    print 'Script complete'

    f.close()
except:
    if arcpy.Exists(CSVFile):
        w.writerow(dataList)
        print 'Data written in CSV file'
    f.close()
    print 'Program failed.'
    print 'Check python errors.'
    
    
    tb = sys.exc_info()[2]
    tbinfo = traceback.format_tb(tb)[0]
    pymsg = "PYTHON ERRORS:\nTraceback Info:\n" + tbinfo + "\nError Info:\n     " +        str(sys.exc_type) + ": " + str(sys.exc_value) + "\n"
    msgs = "ARCPY ERRORS:\n" + arcpy.GetMessages(2) + "\n"
    
    arcpy.AddError(msgs)
    arcpy.AddError(pymsg)

    print msgs
    print pymsg
    
    arcpy.AddMessage(arcpy.GetMessages(1))
    print arcpy.GetMessages(1)
    # print ("Points to select: {0}, Radius: {1}".format(centroidPath, radius))

RichardFairhurst · ‎06-04-2015

The Nested Cursors are absolutely without doubt the source of the problem. In memory processing with dictionaries instead of iterative SQL statements will improve the speed 100 to 1000 fold with a two level nested cursor. The slow speed is due to all of the queries you are doing, especially if the query fields are not indexed. The dictionary method will perform extremely fast regardless of whether fields have an index or not. I pursued my solution for years to avoid the speed bottleneck you are experiencing with nested cursors and have no doubt that avoiding SQL and using in memory dictionaries for all record matching will make this process complete in a matter of minutes.

Dictionaries do random access matching, while SQL queries are linear against the entire table. Therefore, even with an in memory table the SQL will never perform as fast as a dictionary. The problem grows exponentially worse as the records in the two (or three) sources increase with the nested cursor approach. The speed loss with dictionaries is hardly impacted at all by even doubling all of the records in all of the data sources.

Read this article to discover why dictionaries are so much better than linear processing collections (which cursors and SQL operations on tables basically are) for lookup/record matching processes. Dictionaries are fast because: "The insert/delete/lookup time of items in the dictionary is amortized constant time - O(1) - which means no matter how big the dictionary gets, the time it takes to find something remains relatively constant."

View solution in original post

TonyAlmeida · ‎06-03-2015

Are you using 64 bit python?

Have you tried using In_Memory workspace?

LauraBlackburn · ‎06-03-2015

Hi Tony,

I am using IDLE on a 64 bit windows 7. I am new to python, so I don't know how to do the in_memory workspace. Can you walk me through it?

I just find it odd that it went so much faster a few months ago and now it's super slow.

BlakeTerhune · ‎06-03-2015

Agreed, in_memory workspace can improve performance. How many points and lines do you have?

Although it probably won't improve performance, opening your csv file using a with statement (like you do with the arcpy cursors) will automatically close the file if there is an error so you don't have to worry about adding f.close() everywhere.

LauraBlackburn · ‎06-03-2015

This was all part of a network analysis. I had created 30 blocks that are 12x12km and placed points inside those blocks at 6 different scales - so for the 6K scale there are 4 points, for the 4K scale there are 9 points, for the 3K scale there are 16 points, for the 2K scale there are 36, for the 1K there are 144 and for the 0.5K there are 576 points. I have no idea how many line segments - I am using a roads network for that and clipping at two scales around each point to determine the average road density around the point. So the road density is computed around the point in a 6000 meter radius and a radius equal to the point spacing (either 500 m, 1000m, 2000m, 3000m, 4000m or 6000m).

Anyway, I can't quite grasp why it would be so slow this time w/ the same code I used before. It doesn't really seem to be taxing my system much when I look in the task manager - CPU usage 17% and physical memory 37%.

Thanks for the link Blake - I'll check into that.

IanMurray · ‎06-03-2015

Nesting cursors is generally accepted be a very slow way to manage and manipulate data between two sources. Two can be fairly slow, so 3 nested would be even worse. Richard Fairhurst has made some good blog posts for using dictionaries with cursors to speed up processes such as your. This is the original thread he addressed it in Updating the rows in one table based on values in another table or using nested cursors and his blog post regarding using dictionaries and cursors can be found Turbo Charging Data Manipulation with Python Cursors and Dictionaries.

I didn't dig into your code too much, but I've noticed many times where people have mentioned nested cursors being rather slow and that if you are taking days to process the data, there must be a better way. Perhaps Richard will have some suggestions for you.

LauraBlackburn · ‎06-03-2015

Ah, I knew I could count on this forum to find a solution to my problem. I'll look into both of these options in the morning. I knew the nested cursors was messy but couldn't find a work around, so thank you Ian for leading me to Richard Fairhurst's information.

Laura

RichardFairhurst · ‎06-03-2015

Laura;

In general the inner cursors should come first and be converted to a new dictionary for each Search cursor. An Update Cursor follows in a completely separate loop after the first loops finish. The first loop Search cursor(s) would read the entire table into the dictionary all at once. Each Search cursor is processed in a separate loop prior to doing any data updates. Alternatively, you can include Python logic while every record is being read to only load the records that make sense into the dictionary for matching to the second separate update cursor loop. The Python logic would somewhat take the place of the iterative SQL expressions.

The dictionary key set up with the SearchCursor has to make sense for the match to the other table. Two or more fields can be loaded as a Tuple key (not a list). Using the Python Tuple() and List() methods the key can be converted back and forth for use as a dictionary key or a list for iterating and modifying the data. A Dictionary key can point to a value that is a list of lists to deal with One to Many or Many to Many relationships.

You may want to look at the code in this thread for my Multiple Field Key to Single Field Key tool to see an example of some more advanced options that have expanded the techniques outlined in the Blog. The tool itself may actually be able to create a Single Key that would take the place of the inner loop record matching if the purpose of the inner loop is to process a Multiple field relationship. Once the Multiple Field key is converted to a Single Field Key you can use more traditional join techniques and the Field Calculator to do data transfers. I have been using the tool do multiple field matches on two record sets containing 100K records and 800K records and the tool normally finishes in about 1 to 2 minutes.

LauraBlackburn · ‎06-04-2015

Thanks Richard for your input. It's so great to have such fast and insightful feedback from this forum!

So what's my best bet for increasing processing speed of this script? Do I just write to the in-memory workspace and leave the nested cursors? Or, should I incorporate both the in-memory workspace and the dictionary keys? Which of these changes will have the most significant impact?

RichardFairhurst · ‎06-04-2015

The Nested Cursors are absolutely without doubt the source of the problem. In memory processing with dictionaries instead of iterative SQL statements will improve the speed 100 to 1000 fold with a two level nested cursor. The slow speed is due to all of the queries you are doing, especially if the query fields are not indexed. The dictionary method will perform extremely fast regardless of whether fields have an index or not. I pursued my solution for years to avoid the speed bottleneck you are experiencing with nested cursors and have no doubt that avoiding SQL and using in memory dictionaries for all record matching will make this process complete in a matter of minutes.

Dictionaries do random access matching, while SQL queries are linear against the entire table. Therefore, even with an in memory table the SQL will never perform as fast as a dictionary. The problem grows exponentially worse as the records in the two (or three) sources increase with the nested cursor approach. The speed loss with dictionaries is hardly impacted at all by even doubling all of the records in all of the data sources.

Read this article to discover why dictionaries are so much better than linear processing collections (which cursors and SQL operations on tables basically are) for lookup/record matching processes. Dictionaries are fast because: "The insert/delete/lookup time of items in the dictionary is amortized constant time - O(1) - which means no matter how big the dictionary gets, the time it takes to find something remains relatively constant."