Select to view content in your preferred language

Turbo Charging Data Manipulation with Python Cursors and Dictionaries

43317
59
11-08-2014 02:16 AM
RichardFairhurst
MVP Honored Contributor
47 59 43.3K

The Problems with Using the Field Calculator, Join and Summary Statistics tools

 

The field calculator is a great tool and easy to set up, but it has several limitations that can seriously slow down scripted routines when you need to do a lot of data manipulation with large data sets.  The most obvious limitation is that only one field can be calculated at a time.  This often means having to process every record in the feature class multiple times in order to update several fields.  If you need to do data transfers between feature classes using a join it is critical to index the join field in both feature classes to avoid calculations that take forever to process, and even then the speed is significantly slower when a join is involved.  Also, if you actually need to use multiple fields to create a join that correctly matches records between feature classes your only option is to create a new field in both feature classes that concatenates those field values to make a join that works.  Also, often before these joins can be created a summary table has to be created, which takes more time.

 

Fortunately if you have ArcGIS Desktop 10.1 or later there is an alternative that can speed scripts like this up dramatically and make data transfers between feature classes both flexible and incredibly fast.  The solution is to replace the field calculator with one or more data access cursors (da cursors) and joins and creating summary tables with one or more python dictionaries built using a da SearchCursor.

 

Using a da UpdateCursor to Replace a Field Calculation within a Single Feature Class:

 

Below is a simple script that will use a da UpdateCursor on a feature class to fill in a URL field with a hyperlink that is built using a value taken from another field.  The strftime method is used to print out the time when the script starts and the time when it finishes.  To adapt this script to your own data you only need to change the Feature Class path and name in line 7, the field names in the field list in line 9 and the URL string in line 14.  This should perform about 7 times faster than a similar field calculation.

 

from time import strftime

print( "Start script: " + strftime("%Y-%m-%d %H:%M:%S"))

import arcpy

updateFC = r"C:\Path\FeatureClass"

updateFieldsList =  ["URLField", "ValueField"] # Example fields that will build a URL based on another field

with arcpy.da.UpdateCursor(updateFC, updateFieldsList) as updateRows:
    for updateRow in updateRows:
        # updateRow[0] is the URL field and updateRow[1] is the value field

        updateRow[0] = r"
http://pic.asrclkrec.com/view_ParcelMaps.aspx?ParcelNumber=
" + str(updateRow[1])

        updateRows.updateRow(updateRow)

print( "Finished APN_MAP field update: " + strftime("%Y-%m-%d %H:%M:%S"))

‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

 

Using a da UpdateCursor to Replace Several Field Calculations within a Single Feature Class:

 

Now lets say that we actually need to process 2 more field calculations to complete the update of this feature class.  One calculation needs to add two other fields together, and the other needs to convert the values in a field from capitalized values to title case values.  So if there are 100K records, then all of those records have to be calculated 3 times.

 

All of this can be handled in a single pass of one da UpdateCursor through the 100K records.  Because the records only have to changed once and the addition of more fields for the UpdateCursor to read or calculate takes very little additional time to process, the speed gain over 3 field calculations is much faster that the previous script.  As a result, the more field calculations you can replace with a single pass of a da UpdateCursor, the more dramatically the speed of the script will improve.

 

Below is the script that can do all three field updates (assume that there are no Null values in the fields being added together or changing case).  To use this script with your own data you still only need to change lines 7, 9, and 14 (optionally change the calculations in lines 16 and 18).

 

from time import strftime

print( "Start script: " + strftime("%Y-%m-%d %H:%M:%S"))

import arcpy

updateFC = r"C:\Path\FeatureClass"

updateFieldsList =  ["URLField", "ValueField", "SumField", "NumberField1", "NumberField2", "TitleCaseField"] # Example fields that will build a URL based on another field

with arcpy.da.UpdateCursor(updateFC, updateFieldsList) as updateRows:
    for updateRow in updateRows:
        # updateRow[0] is the URL field and updateRow[1] is the value field

        updateRow[0] = r"
http://pic.asrclkrec.com/view_ParcelMaps.aspx?ParcelNumber=
" + str(updateRow[1])


         # updateRow[2] is the Sumfield and updateRow[3] and updateRow[4] are the NumberField1 and NumberField2 fields. 
        updateRow[2] = updateRow[3] + updateRow[4]
         # updateRow[5] is the TitleCaseField that is having its Capitalized values replaced with Title Case values.
        updateRow[
5
] = updateRow[
5
].title()


        updateRows.updateRow(updateRow)


print( "Finished APN_MAP field update: " + strftime("%Y-%m-%d %H:%M:%S"))
‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

 

Using a Python Dictionary Built using a da SearchCursor to Replace a Join Connecting Two Feature Classes:

 

Another common data manipulation involves joining one feature class to another in order to transfer values from the Joined feature class to the Target feature class using the Field Calculator.  However, features loaded into a Python Dictionary using a Search Cursor are much faster at matching join field values in one feature class to another than a Join.  Combined with using a da UpdateCursor to replace the Field Calculator, the speed of these kinds of data manipulations can be even more dramatic than data manipulations on a single feature class.

 

Example 1 - Transfer of a Single Field Value between Feature Classes

 

Below is the code required to transfer the value of one feature class to another based on a common join value in another field.  To adapt this script to your own data change lines 7 and 14 to the path and name of the source (Join) and update (target) Feature Class to match the feature classes you would normally join, change lines 9 and 16 to replace the field list with the name of the Join field and the name of the field being transferred.

 

from time import strftime

print( "Start script: " + strftime("%Y-%m-%d %H:%M:%S"))

import arcpy

sourceFC = r"C:\Path\SourceFeatureClass"

sourceFieldsList = ["JoinField", "ValueField"]

# Use list comprehension to build a dictionary from a da SearchCursor
valueDict = {r[0]:(r[1:]) for r in arcpy.da.SearchCursor(sourceFC, sourceFieldsList)}

updateFC = r"C:\Path\UpdateFeatureClass"

updateFieldsList = ["JoinField", "ValueField"]

with arcpy.da.UpdateCursor(updateFC, updateFieldsList) as updateRows:
    for updateRow in updateRows:
        # store the Join value of the row being updated in a keyValue variable

        keyValue = updateRow[0]

         # verify that the keyValue is in the Dictionary
        if keyValue in valueDict:
             # transfer the value stored under the keyValue from the dictionary to the updated field.
            updateRow[1] = valueDict[keyValue][0]
            updateRows.updateRow(updateRow)

del valueDict

print( "Finished script: " + strftime("%Y-%m-%d %H:%M:%S"))


‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

 

Example 2 - Transfer of Multiple Field Values between Feature Classes where there is a 1:1 Match between Field Sets

 

Just like the Field Calculator the more fields being transferred from one feature class to the other, the more your script will speed up.  If the transfer is a simple field for field transfer then the transfer fields can be handled in a for loop.  Below is an example of a transfer of 5 matching fields between two feature classes.  To adapt this code to your data you should again change the feature classes in lines 7 and 14 and the field lists in lines 9 and 16.

 

from time import strftime

print( "Start script: " + strftime("%Y-%m-%d %H:%M:%S"))

import arcpy

sourceFC = r"C:\Path\SourceFeatureClass"

sourceFieldsList = ["JoinField", "ValueField1", "ValueField2", "ValueField3", "ValueField4", "ValueField5" ]

# Use list comprehension to build a dictionary from a da SearchCursor
valueDict = {r[0]:(r[1:]) for r in arcpy.da.SearchCursor(sourceFC, sourceFieldsList)}

updateFC = r"C:\Path\UpdateFeatureClass"

updateFieldsList = ["JoinField", "ValueField1", "ValueField2", "ValueField3", "ValueField4", "ValueField5"]

with arcpy.da.UpdateCursor(updateFC, updateFieldsList) as updateRows:
    for updateRow in updateRows:
        # store the Join value of the row being updated in a keyValue variable
        keyValue = updateRow[0]
        # verify that the keyValue is in the Dictionary
        if keyValue in valueDict:
            # transfer the values stored under the keyValue from the dictionary to the updated fields.
            for n in range (1,len(sourceFieldsList)):
                updateRow[n] = valueDict[keyValue][n-1]
            updateRows.updateRow(updateRow)

del valueDict

print( "Finished script: " + strftime("%Y-%m-%d %H:%M:%S"))‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

 

Example 3 - Transfer of Multiple Field Values between Feature Classes involving Complex Field Manipulations

 

If you need to do more complex field transfers that do more than a simple transfer of data from one table to another you can do this using code similar to the example field calculations that were done on a single table.  Here are the 3 field updates of my second field calculation example done between two separate tables with common join values instead of all within a single table.

 

from time import strftime

print( "Start script: " + strftime("%Y-%m-%d %H:%M:%S"))

import arcpy

sourceFC = r"C:\Path\SourceFeatureClass"

sourceFieldsList = ["JoinField", "ValueField", "NumberField1", "NumberField2", "CapitalizedField" ]

# Use list comprehension to build a dictionary from a da SearchCursor
valueDict = {r[0]:(r[1:]) for r in arcpy.da.SearchCursor(sourceFC, sourceFieldsList)}

updateFC = r"C:\Path\UpdateFeatureClass"

updateFieldsList = ["JoinField", "UrlField1", "SumField", "TitleCaseField"]

with arcpy.da.UpdateCursor(updateFC, updateFieldsList) as updateRows:
    for updateRow in updateRows:
        keyValue = updateRow[0]
        # verify that the keyValue is in the Dictionary
        if keyValue in valueDict:
             # updateRow[1] is the URL field and valueDict[keyValue][0] is the value field
             updateRow[1] = r"
http://pic.asrclkrec.com/view_ParcelMaps.aspx?ParcelNumber=
" + str(valueDict[keyValue][0])

             # updateRow[2] is the Sumfield and valueDict[keyValue][1] and valueDict[keyValue][2] are the NumberField1 and NumberField2 fields. 
             updateRow[2] = valueDict[keyValue][1] + valueDict[keyValue][2]
             # updateRow[3] is the TitleCaseField and valueDict[keyValue][3] is the CapitalizedField values.
             updateRow[3

] = valueDict[keyValue][3

].title()

             updateRows.updateRow(updateRow)

del valueDict

print( "Finished script: " + strftime("%Y-%m-%d %H:%M:%S"))

‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

 

Creating a Multi-Field Python Dictionary Key to Replace a Concatenated Join Field:

 

Handling Multi-field join values that match records between two feature classes is also possible using dictionaries and cursors without having to do concatenation of separate fields into a new field.  The concatenation is still necessary, but it can be handled on the fly in memory, which is much faster than calculating a field.  Below shows an example of the single field transfer, but this time matching records using 3 separate fields to define the unique values between the feature classes used to match the records between the feature classes.

 

from time import strftime

print( "Start script: " + strftime("%Y-%m-%d %H:%M:%S"))

import arcpy

sourceFC = r"C:\Path\SourceFeatureClass"

sourceFieldsList = ["JoinField1", "JoinField2", "JoinField3", "ValueField"]

# Use list comprehension to build a dictionary from a da SearchCursor where the key values are based on 3 separate feilds
valueDict = {str(r[0]) + "," + str(r[1]) + "," + str(r[2]):(r[3:]) for r in arcpy.da.SearchCursor(sourceFC, sourceFieldsList)}

updateFC = r"C:\Path\UpdateFeatureClass"

updateFieldsList = ["JoinField1", "JoinField2", "JoinField3", "ValueField"]

with arcpy.da.UpdateCursor(updateFC, updateFieldsList) as updateRows:
    for updateRow in updateRows:
        # store the Join value by combining 3 field values of the row being updated in a keyValue variable
        keyValue = updateRow[0]+ "," + str(updateRow[1]) + "," + str(updateRow[2]
        # verify that the keyValue is in the Dictionary
        if keyValue in valueDict:
            # transfer the value stored under the keyValue from the dictionary to the updated field.
            updateRow[3] = valueDict[keyValue][0]
            updateRows.updateRow(updateRow)

del valueDict

print( "Finished script: " + strftime("%Y-%m-%d %H:%M:%S"))



‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

 

Replacing Successive Joins with Simultaneous Processing of Several Python Dictionaries Built using da SearchCursors

 

When several joins have to be created and removed to do data manipulation between tables, the speed gains can be increased even more, since you can create several dictionaries and then process all of the join relationships simultaneously during a single pass of the update cursor on the updated feature class.  All source tables should be read into dictionaries before  processing the update cursor to achieve this benefit.  The more joins that you can replace this way, the more your script speed will improve.

 

Below is an example where two feature classes that would normally be joined to the updated feature class are replaced by dictionaries and the processed simultaneously by a single UpdateCursor operation.

 

from time import strftime

print( "Start script: " + strftime("%Y-%m-%d %H:%M:%S"))

import arcpy

sourceFC = r"C:\Path\SourceFeatureClass1"

sourceFieldsList = ["JoinField1", "ValueField", "NumberField1", "NumberField2" ]

# Use list comprehension to build a dictionary from a da SearchCursor
valueDict1 = {r[0]:(r[1:]) for r in arcpy.da.SearchCursor(sourceFC, sourceFieldsList)}

sourceFC = r"C:\Path\SourceFeatureClass2"

sourceFieldsList = ["JoinField2", "CapitalizedField" ]

# Use list comprehension to build a dictionary from a da SearchCursor
valueDict2 = {r[0]:(r[1:]) for r in arcpy.da.SearchCursor(sourceFC, sourceFieldsList)}

updateFC = r"C:\Path\UpdateFeatureClass"


updateFieldsList = ["JoinField1", "UrlField1", "SumField", "JoinField2", 
"TitleCaseField"]

with arcpy.da.UpdateCursor(updateFC, updateFieldsList) as updateRows:
    for updateRow in updateRows:
        keyValue1 = updateRow[0]
        # verify that the keyValue1 is in the First Dictionary
        if keyValue1 in valueDict1:
             # updateRow[1] is the URL field and valueDict1[keyValue1][0] is the value field
             updateRow[1] = r"
http://pic.asrclkrec.com/view_ParcelMaps.aspx?ParcelNumber=
" + str(valueDict1[keyValue1][0])

             # updateRow[2] is the Sumfield and valueDict1[keyValue1][1] and valueDict1[keyValue1][2] are the NumberField1 and NumberField2 fields. 
             updateRow[2] = valueDict1[keyValue1][1] + valueDict1[keyValue1][2]
        keyValue2 = updateRow[3]
        # verify that the keyValue2 is in the Second Dictionary
        if keyValue2 in valueDict2:

             # updateRow[4] is the TitleCaseField and valueDict2[keyValue2][0] is the CapitalizedField values.

             updateRow[4

] = valueDict2[keyValue2

][0].title()

        updateRows.updateRow(updateRow)

del valueDict1
del valueDict2

print( "Finished script: " + strftime("%Y-%m-%d %H:%M:%S"))

‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

 

Using a Python Dictionary Built using a da SearchCursor to Replace a Summary Statistics Output Table

 

Another common need is to create a summary of a feature class for the unique values of a field or fields that cover multiple records.  The Python Dictionary can be used instead of a Summary Table output to accomplish this.  The benefit is that the output is stored in memory and is much faster to create than outputting an actual summary table.  Below is an example building a summary dictionary and using it to update the source table with sum values matching that field and create a percentage field in the source table for each record that built the summary.

 

from time import strftime

print( "Start script: " + strftime("%Y-%m-%d %H:%M:%S"))

import arcpy

sourceFC = r"C:\Path\UpdateFeatureClass"

sourceFieldsList = ["UniqueValuesField", "NumberField"]

# Build a summary dictionary from a da SearchCursor with unique key values of a field storing a list of the sum of that value and the record count.
valueDict = {}
with arcpy.da.SearchCursor(sourceFC, sourceFieldsList) as searchRows:
    for searchRow in searchRows:
        keyValue = searchRow[0]
        if not keyValue in valueDict:
             # assign a new keyValue entry to the dictionary storing a list of the first NumberField value and 1 for the first record counter value
            valueDict[keyValue] = [searchRow[1], 1]
        # Sum the last summary of NumberField value with the current record and increment the record count when keyvalue is already in the dictionary  
        else:
            valueDict[keyValue][0] += searchRow[1]

            valueDict[keyValue][1] += 1  


updateFC = r"C:\Path\UpdateFeatureClass"

updateFieldsList = ["UniqueValuesField", "NumberField", "PercentField", "NumberSumField", "MeanField"]

with arcpy.da.UpdateCursor(updateFC, updateFieldsList) as updateRows:
    for updateRow in updateRows:
        # store the Join value of the row being updated in a keyValue variable
        keyValue = updateRow[0]
        # verify that the keyValue is in the Dictionary
        if keyValue in valueDict:
            # divide the current record's NumberField value by the sum of the NumberField to get a percentage value

            updateRow[2] = updateRow[1] / valueDict[keyValue][0] 
            # transfer the sum of the NumberField stored under the keyValue from the dictionary.
            updateRow[3] = valueDict[keyValue][0]
            # divide the sum of the NumberField value by the record count of the NumberField to get a mean value

            updateRow[4] = valueDict[keyValue][0] / valueDict[keyValue][1]

            updateRows.updateRow(updateRow)


del valueDict

print( "Finished script: " + strftime("%Y-%m-%d %H:%M:%S"))


‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

 

Conclusion

 

Once the patterns are understood it is possible to do virtually everything that you can do with the Field Calculator, a Join or a Summary table using an update cursor or a python dictionary.   Using these coding techniques I have been able to rewrite several scripts that did bulk Data Manipulations so that rather than taking more than an hour each to process, they now only take 3 to 10 minutes to process.  The more of these operations that you can replace using code like this, the greater the savings in processing time you will achieve.  The performance gains can be huge making it possible to keep data up to date without extensive waits or consuming large portions of your after hours batch processing time.  It also solves the problems that arise when you cannot alter the source data schema to add indexes or field concatenations and making copies of the data is not practical.

 

if you find any errors in the sample code, please let me know and I will post an update to this Blog to point out the correction. Also if you need more comments added to the code to help you understand what is going on, let me know and I will try to comment the code more fully.  These code examples are designed to work on shapefiles, file geodatabase feature classes, and unversioned SDE feature classes.  Additional code modifications are required to make them work with versioned SDE feature classes.

 

I want to express my thanks to Chris Synder who first demonstrated the usefulness of these techniques to me through the Forum and helped me understand how to apply them.

59 Comments
Labels