Select to view content in your preferred language

Turbo Charging Data Manipulation with Python Cursors and Dictionaries

54306
91
11-08-2014 02:16 AM
RichardFairhurst
MVP Alum
48 91 54.3K

The Problems with Using the Field Calculator, Join and Summary Statistics tools

 

The field calculator is a great tool and easy to set up, but it has several limitations that can seriously slow down scripted routines when you need to do a lot of data manipulation with large data sets.  The most obvious limitation is that only one field can be calculated at a time.  This often means having to process every record in the feature class multiple times in order to update several fields.  If you need to do data transfers between feature classes using a join it is critical to index the join field in both feature classes to avoid calculations that take forever to process, and even then the speed is significantly slower when a join is involved.  Also, if you actually need to use multiple fields to create a join that correctly matches records between feature classes your only option is to create a new field in both feature classes that concatenates those field values to make a join that works.  Also, often before these joins can be created a summary table has to be created, which takes more time.

 

Fortunately if you have ArcGIS Desktop 10.1 or later there is an alternative that can speed scripts like this up dramatically and make data transfers between feature classes both flexible and incredibly fast.  The solution is to replace the field calculator with one or more data access cursors (da cursors) and joins and creating summary tables with one or more python dictionaries built using a da SearchCursor.

 

Using a da UpdateCursor to Replace a Field Calculation within a Single Feature Class:

 

Below is a simple script that will use a da UpdateCursor on a feature class to fill in a URL field with a hyperlink that is built using a value taken from another field.  The strftime method is used to print out the time when the script starts and the time when it finishes.  To adapt this script to your own data you only need to change the Feature Class path and name in line 7, the field names in the field list in line 9 and the URL string in line 14.  This should perform about 7 times faster than a similar field calculation.

 

from time import strftime

print( "Start script: " + strftime("%Y-%m-%d %H:%M:%S"))

import arcpy

updateFC = r"C:\Path\FeatureClass"

updateFieldsList =  ["URLField", "ValueField"] # Example fields that will build a URL based on another field

with arcpy.da.UpdateCursor(updateFC, updateFieldsList) as updateRows:
    for updateRow in updateRows:
        # updateRow[0] is the URL field and updateRow[1] is the value field

        updateRow[0] = r"
http://pic.asrclkrec.com/view_ParcelMaps.aspx?ParcelNumber=
" + str(updateRow[1])

        updateRows.updateRow(updateRow)

print( "Finished APN_MAP field update: " + strftime("%Y-%m-%d %H:%M:%S"))

‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

 

Using a da UpdateCursor to Replace Several Field Calculations within a Single Feature Class:

 

Now lets say that we actually need to process 2 more field calculations to complete the update of this feature class.  One calculation needs to add two other fields together, and the other needs to convert the values in a field from capitalized values to title case values.  So if there are 100K records, then all of those records have to be calculated 3 times.

 

All of this can be handled in a single pass of one da UpdateCursor through the 100K records.  Because the records only have to changed once and the addition of more fields for the UpdateCursor to read or calculate takes very little additional time to process, the speed gain over 3 field calculations is much faster that the previous script.  As a result, the more field calculations you can replace with a single pass of a da UpdateCursor, the more dramatically the speed of the script will improve.

 

Below is the script that can do all three field updates (assume that there are no Null values in the fields being added together or changing case).  To use this script with your own data you still only need to change lines 7, 9, and 14 (optionally change the calculations in lines 16 and 18).

 

from time import strftime

print( "Start script: " + strftime("%Y-%m-%d %H:%M:%S"))

import arcpy

updateFC = r"C:\Path\FeatureClass"

updateFieldsList =  ["URLField", "ValueField", "SumField", "NumberField1", "NumberField2", "TitleCaseField"] # Example fields that will build a URL based on another field

with arcpy.da.UpdateCursor(updateFC, updateFieldsList) as updateRows:
    for updateRow in updateRows:
        # updateRow[0] is the URL field and updateRow[1] is the value field

        updateRow[0] = r"
http://pic.asrclkrec.com/view_ParcelMaps.aspx?ParcelNumber=
" + str(updateRow[1])


         # updateRow[2] is the Sumfield and updateRow[3] and updateRow[4] are the NumberField1 and NumberField2 fields. 
        updateRow[2] = updateRow[3] + updateRow[4]
         # updateRow[5] is the TitleCaseField that is having its Capitalized values replaced with Title Case values.
        updateRow[
5
] = updateRow[
5
].title()


        updateRows.updateRow(updateRow)


print( "Finished APN_MAP field update: " + strftime("%Y-%m-%d %H:%M:%S"))
‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

 

Using a Python Dictionary Built using a da SearchCursor to Replace a Join Connecting Two Feature Classes:

 

Another common data manipulation involves joining one feature class to another in order to transfer values from the Joined feature class to the Target feature class using the Field Calculator.  However, features loaded into a Python Dictionary using a Search Cursor are much faster at matching join field values in one feature class to another than a Join.  Combined with using a da UpdateCursor to replace the Field Calculator, the speed of these kinds of data manipulations can be even more dramatic than data manipulations on a single feature class.

 

Example 1 - Transfer of a Single Field Value between Feature Classes

 

Below is the code required to transfer the value of one feature class to another based on a common join value in another field.  To adapt this script to your own data change lines 7 and 14 to the path and name of the source (Join) and update (target) Feature Class to match the feature classes you would normally join, change lines 9 and 16 to replace the field list with the name of the Join field and the name of the field being transferred.

 

from time import strftime

print( "Start script: " + strftime("%Y-%m-%d %H:%M:%S"))

import arcpy

sourceFC = r"C:\Path\SourceFeatureClass"

sourceFieldsList = ["JoinField", "ValueField"]

# Use list comprehension to build a dictionary from a da SearchCursor
valueDict = {r[0]:(r[1:]) for r in arcpy.da.SearchCursor(sourceFC, sourceFieldsList)}

updateFC = r"C:\Path\UpdateFeatureClass"

updateFieldsList = ["JoinField", "ValueField"]

with arcpy.da.UpdateCursor(updateFC, updateFieldsList) as updateRows:
    for updateRow in updateRows:
        # store the Join value of the row being updated in a keyValue variable

        keyValue = updateRow[0]

         # verify that the keyValue is in the Dictionary
        if keyValue in valueDict:
             # transfer the value stored under the keyValue from the dictionary to the updated field.
            updateRow[1] = valueDict[keyValue][0]
            updateRows.updateRow(updateRow)

del valueDict

print( "Finished script: " + strftime("%Y-%m-%d %H:%M:%S"))


‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

 

Example 2 - Transfer of Multiple Field Values between Feature Classes where there is a 1:1 Match between Field Sets

 

Just like the Field Calculator the more fields being transferred from one feature class to the other, the more your script will speed up.  If the transfer is a simple field for field transfer then the transfer fields can be handled in a for loop.  Below is an example of a transfer of 5 matching fields between two feature classes.  To adapt this code to your data you should again change the feature classes in lines 7 and 14 and the field lists in lines 9 and 16.

 

from time import strftime

print( "Start script: " + strftime("%Y-%m-%d %H:%M:%S"))

import arcpy

sourceFC = r"C:\Path\SourceFeatureClass"

sourceFieldsList = ["JoinField", "ValueField1", "ValueField2", "ValueField3", "ValueField4", "ValueField5" ]

# Use list comprehension to build a dictionary from a da SearchCursor
valueDict = {r[0]:(r[1:]) for r in arcpy.da.SearchCursor(sourceFC, sourceFieldsList)}

updateFC = r"C:\Path\UpdateFeatureClass"

updateFieldsList = ["JoinField", "ValueField1", "ValueField2", "ValueField3", "ValueField4", "ValueField5"]

with arcpy.da.UpdateCursor(updateFC, updateFieldsList) as updateRows:
    for updateRow in updateRows:
        # store the Join value of the row being updated in a keyValue variable
        keyValue = updateRow[0]
        # verify that the keyValue is in the Dictionary
        if keyValue in valueDict:
            # transfer the values stored under the keyValue from the dictionary to the updated fields.
            for n in range (1,len(sourceFieldsList)):
                updateRow[n] = valueDict[keyValue][n-1]
            updateRows.updateRow(updateRow)

del valueDict

print( "Finished script: " + strftime("%Y-%m-%d %H:%M:%S"))‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

 

Example 3 - Transfer of Multiple Field Values between Feature Classes involving Complex Field Manipulations

 

If you need to do more complex field transfers that do more than a simple transfer of data from one table to another you can do this using code similar to the example field calculations that were done on a single table.  Here are the 3 field updates of my second field calculation example done between two separate tables with common join values instead of all within a single table.

 

from time import strftime

print( "Start script: " + strftime("%Y-%m-%d %H:%M:%S"))

import arcpy

sourceFC = r"C:\Path\SourceFeatureClass"

sourceFieldsList = ["JoinField", "ValueField", "NumberField1", "NumberField2", "CapitalizedField" ]

# Use list comprehension to build a dictionary from a da SearchCursor
valueDict = {r[0]:(r[1:]) for r in arcpy.da.SearchCursor(sourceFC, sourceFieldsList)}

updateFC = r"C:\Path\UpdateFeatureClass"

updateFieldsList = ["JoinField", "UrlField1", "SumField", "TitleCaseField"]

with arcpy.da.UpdateCursor(updateFC, updateFieldsList) as updateRows:
    for updateRow in updateRows:
        keyValue = updateRow[0]
        # verify that the keyValue is in the Dictionary
        if keyValue in valueDict:
             # updateRow[1] is the URL field and valueDict[keyValue][0] is the value field
             updateRow[1] = r"
http://pic.asrclkrec.com/view_ParcelMaps.aspx?ParcelNumber=
" + str(valueDict[keyValue][0])

             # updateRow[2] is the Sumfield and valueDict[keyValue][1] and valueDict[keyValue][2] are the NumberField1 and NumberField2 fields. 
             updateRow[2] = valueDict[keyValue][1] + valueDict[keyValue][2]
             # updateRow[3] is the TitleCaseField and valueDict[keyValue][3] is the CapitalizedField values.
             updateRow[3

] = valueDict[keyValue][3

].title()

             updateRows.updateRow(updateRow)

del valueDict

print( "Finished script: " + strftime("%Y-%m-%d %H:%M:%S"))

‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

 

Creating a Multi-Field Python Dictionary Key to Replace a Concatenated Join Field:

 

Handling Multi-field join values that match records between two feature classes is also possible using dictionaries and cursors without having to do concatenation of separate fields into a new field.  The concatenation is still necessary, but it can be handled on the fly in memory, which is much faster than calculating a field.  Below shows an example of the single field transfer, but this time matching records using 3 separate fields to define the unique values between the feature classes used to match the records between the feature classes.

 

from time import strftime

print( "Start script: " + strftime("%Y-%m-%d %H:%M:%S"))

import arcpy

sourceFC = r"C:\Path\SourceFeatureClass"

sourceFieldsList = ["JoinField1", "JoinField2", "JoinField3", "ValueField"]

# Use list comprehension to build a dictionary from a da SearchCursor where the key values are based on 3 separate feilds
valueDict = {str(r[0]) + "," + str(r[1]) + "," + str(r[2]):(r[3:]) for r in arcpy.da.SearchCursor(sourceFC, sourceFieldsList)}

updateFC = r"C:\Path\UpdateFeatureClass"

updateFieldsList = ["JoinField1", "JoinField2", "JoinField3", "ValueField"]

with arcpy.da.UpdateCursor(updateFC, updateFieldsList) as updateRows:
    for updateRow in updateRows:
        # store the Join value by combining 3 field values of the row being updated in a keyValue variable
        keyValue = updateRow[0]+ "," + str(updateRow[1]) + "," + str(updateRow[2]
        # verify that the keyValue is in the Dictionary
        if keyValue in valueDict:
            # transfer the value stored under the keyValue from the dictionary to the updated field.
            updateRow[3] = valueDict[keyValue][0]
            updateRows.updateRow(updateRow)

del valueDict

print( "Finished script: " + strftime("%Y-%m-%d %H:%M:%S"))



‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

 

Replacing Successive Joins with Simultaneous Processing of Several Python Dictionaries Built using da SearchCursors

 

When several joins have to be created and removed to do data manipulation between tables, the speed gains can be increased even more, since you can create several dictionaries and then process all of the join relationships simultaneously during a single pass of the update cursor on the updated feature class.  All source tables should be read into dictionaries before  processing the update cursor to achieve this benefit.  The more joins that you can replace this way, the more your script speed will improve.

 

Below is an example where two feature classes that would normally be joined to the updated feature class are replaced by dictionaries and the processed simultaneously by a single UpdateCursor operation.

 

from time import strftime

print( "Start script: " + strftime("%Y-%m-%d %H:%M:%S"))

import arcpy

sourceFC = r"C:\Path\SourceFeatureClass1"

sourceFieldsList = ["JoinField1", "ValueField", "NumberField1", "NumberField2" ]

# Use list comprehension to build a dictionary from a da SearchCursor
valueDict1 = {r[0]:(r[1:]) for r in arcpy.da.SearchCursor(sourceFC, sourceFieldsList)}

sourceFC = r"C:\Path\SourceFeatureClass2"

sourceFieldsList = ["JoinField2", "CapitalizedField" ]

# Use list comprehension to build a dictionary from a da SearchCursor
valueDict2 = {r[0]:(r[1:]) for r in arcpy.da.SearchCursor(sourceFC, sourceFieldsList)}

updateFC = r"C:\Path\UpdateFeatureClass"


updateFieldsList = ["JoinField1", "UrlField1", "SumField", "JoinField2", 
"TitleCaseField"]

with arcpy.da.UpdateCursor(updateFC, updateFieldsList) as updateRows:
    for updateRow in updateRows:
        keyValue1 = updateRow[0]
        # verify that the keyValue1 is in the First Dictionary
        if keyValue1 in valueDict1:
             # updateRow[1] is the URL field and valueDict1[keyValue1][0] is the value field
             updateRow[1] = r"
http://pic.asrclkrec.com/view_ParcelMaps.aspx?ParcelNumber=
" + str(valueDict1[keyValue1][0])

             # updateRow[2] is the Sumfield and valueDict1[keyValue1][1] and valueDict1[keyValue1][2] are the NumberField1 and NumberField2 fields. 
             updateRow[2] = valueDict1[keyValue1][1] + valueDict1[keyValue1][2]
        keyValue2 = updateRow[3]
        # verify that the keyValue2 is in the Second Dictionary
        if keyValue2 in valueDict2:

             # updateRow[4] is the TitleCaseField and valueDict2[keyValue2][0] is the CapitalizedField values.

             updateRow[4

] = valueDict2[keyValue2

][0].title()

        updateRows.updateRow(updateRow)

del valueDict1
del valueDict2

print( "Finished script: " + strftime("%Y-%m-%d %H:%M:%S"))

‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

 

Using a Python Dictionary Built using a da SearchCursor to Replace a Summary Statistics Output Table

 

Another common need is to create a summary of a feature class for the unique values of a field or fields that cover multiple records.  The Python Dictionary can be used instead of a Summary Table output to accomplish this.  The benefit is that the output is stored in memory and is much faster to create than outputting an actual summary table.  Below is an example building a summary dictionary and using it to update the source table with sum values matching that field and create a percentage field in the source table for each record that built the summary.

 

from time import strftime

print( "Start script: " + strftime("%Y-%m-%d %H:%M:%S"))

import arcpy

sourceFC = r"C:\Path\UpdateFeatureClass"

sourceFieldsList = ["UniqueValuesField", "NumberField"]

# Build a summary dictionary from a da SearchCursor with unique key values of a field storing a list of the sum of that value and the record count.
valueDict = {}
with arcpy.da.SearchCursor(sourceFC, sourceFieldsList) as searchRows:
    for searchRow in searchRows:
        keyValue = searchRow[0]
        if not keyValue in valueDict:
             # assign a new keyValue entry to the dictionary storing a list of the first NumberField value and 1 for the first record counter value
            valueDict[keyValue] = [searchRow[1], 1]
        # Sum the last summary of NumberField value with the current record and increment the record count when keyvalue is already in the dictionary  
        else:
            valueDict[keyValue][0] += searchRow[1]

            valueDict[keyValue][1] += 1  


updateFC = r"C:\Path\UpdateFeatureClass"

updateFieldsList = ["UniqueValuesField", "NumberField", "PercentField", "NumberSumField", "MeanField"]

with arcpy.da.UpdateCursor(updateFC, updateFieldsList) as updateRows:
    for updateRow in updateRows:
        # store the Join value of the row being updated in a keyValue variable
        keyValue = updateRow[0]
        # verify that the keyValue is in the Dictionary
        if keyValue in valueDict:
            # divide the current record's NumberField value by the sum of the NumberField to get a percentage value

            updateRow[2] = updateRow[1] / valueDict[keyValue][0] 
            # transfer the sum of the NumberField stored under the keyValue from the dictionary.
            updateRow[3] = valueDict[keyValue][0]
            # divide the sum of the NumberField value by the record count of the NumberField to get a mean value

            updateRow[4] = valueDict[keyValue][0] / valueDict[keyValue][1]

            updateRows.updateRow(updateRow)


del valueDict

print( "Finished script: " + strftime("%Y-%m-%d %H:%M:%S"))


‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

 

Conclusion

 

Once the patterns are understood it is possible to do virtually everything that you can do with the Field Calculator, a Join or a Summary table using an update cursor or a python dictionary.   Using these coding techniques I have been able to rewrite several scripts that did bulk Data Manipulations so that rather than taking more than an hour each to process, they now only take 3 to 10 minutes to process.  The more of these operations that you can replace using code like this, the greater the savings in processing time you will achieve.  The performance gains can be huge making it possible to keep data up to date without extensive waits or consuming large portions of your after hours batch processing time.  It also solves the problems that arise when you cannot alter the source data schema to add indexes or field concatenations and making copies of the data is not practical.

 

if you find any errors in the sample code, please let me know and I will post an update to this Blog to point out the correction. Also if you need more comments added to the code to help you understand what is going on, let me know and I will try to comment the code more fully.  These code examples are designed to work on shapefiles, file geodatabase feature classes, and unversioned SDE feature classes.  Additional code modifications are required to make them work with versioned SDE feature classes.

 

I want to express my thanks to Chris Synder who first demonstrated the usefulness of these techniques to me through the Forum and helped me understand how to apply them.

91 Comments
DanPatterson_Retired
MVP Emeritus

Nice job Richard...bookmarked for my classes

RichardFairhurst
MVP Alum

I found one mistake in my code examples.  Where ever I tried to convert a string to Title Case the syntax should not have been:

updateRow[5] = Title(updateRow[5])

It should have been:

updateRow[5] = updateRow[5].title()

Note:  Original Blog post has been corrected.

HunterWest
Occasional Contributor

Thanks for this, Richard - very helpful.

Would you mind breaking down the logic in this line for me? I would like to know what each of the components is doing:

valueDict1 = {r[0]:(r[1:]) for r in arcpy.da.SearchCursor(sourceFC, sourceFieldsList)} 

RichardFairhurst
MVP Alum

Hunter:

That is a good question, and trying to explain that line of code will probably help me understand and use it to the best advantage in the future as well.

I will try to break it apart from left to right.

valueDict1 = {r[0]:(r[1:]) for r in arcpy.da.SearchCursor(sourceFC, sourceFieldsList)}

1.  Assignment to the final dictionary variable (this should be obvious, but it is definitely important)

valueDict1 =

2.  The opening curly bracket (and the closing curly bracket at the very end) makes everything inside part of a dictionary.  Everything else inside the curly brackets ultimately is designed to extract the Key/Value pairs read from the table that will be used by later dictionary look-up processes.

{ ... }

3.  r stands for a Row read from a cursor opened on your table.  r[0] reads the first (zero based) element of the row, which is actually the value from the first field in the each row that is being read.  The colon following r[0] means that this value will be the dictionary key used by later dictionary look-ups.

r[0]:

4: Following the colon is the dictionary Value associated with the dictionary key.  This is defined by everything falling inside the pair of parentheses, which is a tuple, or collection of values that can be treated and passed as a single value.  Dictionary key/value pairs are limited to using a single value after the colon, but that includes collections like a tuple or a list as long as they can be passed as one value.

( ... )

5.  The r[1:] inside the parenthesis are part of a list comprehension that extracts all of the field values in the rest of the cursor fields being read as the separate items within the tuple or list collection.  The [1:] means start with the value of the second field in the field list and continue appending values from every field that follows until the last field in the field list is read.

6  The "for r in" is part of a list comprehension that iterates through each row in a cursor.

for r in

7.  The last part of the list comprehension occurs before the end curly bracket of the dictionary.  This code opens a search cursor on the specified table (sourceFC) which will be used to read the rows from the table.  The cursor will be read one row at a time by the iterator in item 6 above and each row will contain the values of the fields specified in the provided field list (sourceFieldsList).

arcpy.da.SearchCursor(sourceFC, sourceFieldsList)

The structure above is good where every row will result in a dictionary key that is always unique.  When more than one row can contain the same dictionary key value, usually you need to use a structure to read a cursor into a dictionary similar to the one shown below (there are many possible variations on this pattern, so this is just one example):

  1. # Build a summary dictionary from a da SearchCursor with unique key values of a field storing a list of the sum of that value and the record count. 
  2. valueDict = {}
  3. with arcpy.da.SearchCursor(sourceFC, sourceFieldsList) as searchRows:
  4.     for searchRow in searchRows:
  5.         keyValue = searchRow[0]
  6.         if not keyValue in valueDict:
  7.              # assign a new keyValue entry to the dictionary storing a list of the first NumberField value and 1 for the first record counter value
  8.             valueDict[keyValue] = [searchRow[1], 1]
  9.         # Sum the last summary of NumberField value with the current record and increment the record count when keyvalue is already in the dictionary 
  10.         else:
  11.             valueDict[keyValue][0] += searchRow[1]
  12.             valueDict[keyValue][1] += 1
HunterWest
Occasional Contributor

Great explanation - thanks Richard!

MatthewPeters
Occasional Contributor

This information has been very helpful, Thanks Richard.

I replaced a Join Field process that was taking ~40 minutes with arcpy.da.TableToNumPyArray and arcpy.da.ExtendTable which now completes in 6 seconds.

I wonder why the standard tools aren't implemented with this faster performing module?

ChristopherBevilacqua
Emerging Contributor

This has been one of the most useful Python web pages I have ever found.  The techniques demonstrated here save time by not only running faster than the alternatives but also by cutting way back on the amount of code I need to write.  Thanks!

RichardFairhurst
MVP Alum

Christopher:

Thanks for taking the time to share your appreciation for this technique.  The examples in this Blog are intentionally kept simple to emphasize the essentials of the technique.

For a more advanced example of how the technique can be applied check out the code I developed in response to arcpy - How to improve performance of nested search cursors? - Geographic Information Systems Stack ....  The code in that post demonstrates:

1) how to use dictionaries when you are working with a set of tables that form a multilevel relational tree,

2) how to use the dictionary(list) method to efficiently populate a dictionary key with a list of lists dictionary value that will hold multiple data records associated with the key,

3) how to access the list of lists values when processing the dictionary keys for matches, and

4) how to create and use a tuple that will act as a multi-field composite dictionary key.

DougMerrick
Occasional Contributor

I am having trouble  using the example for "Creating a Multi-Field Python Dictionary Key to Replace a Concatenated Join Field:"

I have replaced with source/update feature classes and field but when I try to run the script I get the error message:

Parsing error SyntaxError: invalid syntax (line 23)

This is line 

        if keyValue in valueDict:  

I am unable to discern what the problem is with the line of code.

from time import strftime

print "Start script: " + strftime("%Y-%m-%d %H:%M:%S")

import arcpy

sourceFC = r"C:\Users\user\Desktop\test.gdb\Repl_Cost"

sourceFieldsList = ["WPP_TYPE", "MATERIAL", "DIAMETER", "Unit_Rate"]

# Use list comprehension to build a dictionary from a da SearchCursor where the key values are based on 3 separate feilds
valueDict = {str(r[0]) + "," + str(r[1]) + "," + str(r[2]):(r[3:]) for r in arcpy.da.SearchCursor(sourceFC, sourceFieldsList)}

updateFC = r"C:\Users\user\Desktop\test.gdb\WPP"

updateFieldsList = ["Asset_Type", "EXTG_Material", "Diameter", "Repl_Cost"]

with arcpy.da.UpdateCursor(updateFC, updateFieldsList) as updateRows:
for updateRow in updateRows:
# store the Join value by combining 3 field values of the row being updated in a keyValue variable
keyValue = updateRow[0]+ "," + str(updateRow[1]) + "," + str(updateRow[2]
# verify that the keyValue is in the Dictionary
if keyValue in valueDict:
# transfer the value stored under the keyValue from the dictionary to the updated field.
updateRow[3] = valueDict[keyValue][0]
updateRows.updateRow(updateRow)

del valueDict

print "Finished script: " +strftime("%Y-%m-%d %H:%M:%S")

NeilAyres
MVP Alum

Firstly, it would help if you format the code correctly so it is more readable.

/blogs/dan_patterson/2016/08/14/script-formatting 

NeilAyres
MVP Alum

But i would think that the error is here :

valueDict = {str(r[0]) + "," + str(r[1]) + "," + str(r[2]):(r[3:]) for r in arcpy.da.SearchCursor(sourceFC, sourceFieldsList)}

That doesn't look like it is going to form a dictionary inside the {}.

Dictionaries are populated with key : value pairs.

Hang on, looking at that again...

But why not do this in 2 steps, use the OID as the key, put all the other data in a value list.

Then calculate a new concatenated value.

ChristopherBevilacqua
Emerging Contributor

I'm not sure if this is your problem or not, but you are missing a closing parenthesis at the end of this line:

keyValue = updateRow[0]+ "," + str(updateRow[1]) + "," + str(updateRow[2]

RichardFairhurst
MVP Alum

Try making the first field a string like you do when you are building the dictionary.  So change:

keyValue = updateRow[0] + "," + str(updateRow[1]) + "," + str(updateRow[2]‍‍‍

to

keyValue = str(updateRow[0]) + "," + str(updateRow[1]) + "," + str(updateRow[2])

 Also add the end parenthesis on that line for the last str() method.

DougMerrick
Occasional Contributor

Thanks for catching that Richard, I am amazed at how fast the python code works.

RichardFairhurst
MVP Alum

Glad to hear you got it working. Now you know why I put "Turbo Charging" in the Blog title.

RyanMonk1
Deactivated User

Can Esri comment if they are working on improving the performance of the Join tool?

https://community.esri.com/ideas/8679 

DavidBrett
Occasional Contributor

This is great...I found this blog post from this thread: arcpy.da.insertcursor slow performance and stopiteration error 

Using arcpy.da.InsertCursor to insert entire row that is fetched from search cursor? - Geographic In... 

Any tips on using the insert cursor method with a dictionary without hard coding the field names?

RichardFairhurst
MVP Alum

I am not sure that a dictionary will improve the performance of code from the second post you referenced at all.  In that post you are not trying to match an existing row.  You're just transferring the data in every row without matching anything.  So the random access capability of the dictionary is not being used and you can't eliminate any of the steps from the suggested code by incorporating a dictionary.

I would agree with the comment that says you should avoid inserting to the geodatabase OID, area, and length fields.  I would also not insert to a GlobalID field if you have one.  There also a check for whether or not a field is editable.  If SDE is involved more code to start and stop an edit session is required, and performance will drop.

So what is the end goal of your question?  A tool?  If so, check out my alternative to the Append tool that is included in my Multi-Field to Single Field tool in this Blog.  It deals with the non-editable fields issue and allows records to be transferred between two stand-alone tables, from a feature class to a standalone table, and between features classes with the same geometry type.  However, I can't recall if I made it compatible with SDE or not.  I think I did, but if not let me know.  I have never tested the code with a geometric network.  I primarily optimized them for simple feature classes and file geodatabases.

HillaryBjorstrom
Frequent Contributor

This works great however i have an issue with too many edits being made on a versioned database and it fails after a while. i have tried adding autocommit however it is not working in the loop. I am also not sure i wrote that correct.

import arceditor
import arcpy
from time import strftime

arcpy.env.workspace ="E:\\arcgisserver\\Resources\\Connections\\GIS_SCRIPTINGVS.sde"


sourceFC="E:\\arcgisserver\\Resources\\ScriptingDB.gdb\\BISRV_BIPERCON"


print "Start script: " + strftime("%Y-%m-%d %H:%M:%S")


sourceFieldsList = ["BI_SRV_MAP_LOC","SRVADDR1","SRVADDR2","SRVST","SRVCITY","SRVZIP","BUSINESS","HOME","BICUST","MOBILE","BI_FORMAT_NAME","BI_FORMAT_ADDL_NAME","BI_KEY_CUST_CD","BI_ROUTE_CD","BI_OUT_PRI_CD","BI_MED_NEC_CD","BI_SRV_LOC","BI_AR_STAT","BI_ACCT", "BI_MTR_PHS", "BI_TRF_NBR", "BI_MTR_NBR"]

edit = arcpy.da.Editor(arcpy.env.workspace)
edit.startEditing(True, True)
edit.startOperation()

# Use list comprehension to build a dictionary from a da SearchCursor
valueDict = {r[0]:(r[1:]) for r in arcpy.da.SearchCursor(sourceFC, sourceFieldsList)}

updateFC = "E:\\arcgisserver\\Resources\\Connections\\GIS_SCRIPTINGVS.sde\\GIS.DBO.ELECTRIC\\GIS.DBO.ServicePoint"

updateFieldsList = ["SERVICE_MAP_LOCATION","SERVICE_ADDRESS","SERVICE_ADDRESS2","STATE","CITY","Zip","BusinessPhone","PrimaryPhone","CustID","MobilePhone","NAME","MISC_CHAR_1","KeyAccount","MISC_CHAR_2","OutagePriority","MEDALERT","Service_Number","SERVICE_STATUS","Account_Number", "CISPhase","TRANSFORMER_ID","MeterNumber"]


arcpy.env.autoCommit = ""


with arcpy.da.UpdateCursor(updateFC, updateFieldsList) as updateRows:
for updateRow in updateRows:
# store the Join value of the row being updated in a keyValue variable
keyValue = updateRow[0]
# verify that the keyValue is in the Dictionary
print keyValue
if keyValue in valueDict:
# transfer the values stored under the keyValue from the dictionary to the updated fields.
for n in range (1,len(sourceFieldsList)):
updateRow = valueDict[keyValue][n-1]
updateRows.updateRow(updateRow)

del valueDict
arcpy.DeleteManagement("test")
edit.stopOperation()
print "edits complete"


# Stop the edit session and save the changes
edit.stopEditing(True)
print "Finished script: " + strftime("%Y-%m-%d %H:%M:%S")
print "Finished script: " + strftime("%Y-%m-%d %H:%M:%S")

Any guidance would be greatly appreciated thanks!

RichardFairhurst
MVP Alum

Hillary:

You have written your code to write to every record, even if nothing actually changed.  To only write to records where the value has changed involves adding an if clause to compare the records and only write to those that are different.  The simple examples provided here are only meant to show the core concept, not illustrate all of the variety of effects that can be achieved by adjusting the program flow and logic.

Here is a portion of a program I wrote for my own needs that will only update records in the update table when they do not match the source table.  The dictionary built in this example trims white-space characters from the source table, since the update table has no white-space padding, but the source table has white-space padding:

import python

# subroutine for trimming whitespace from the right side of a text field value.
def stringCaseTrim(value):
    tempstr = value
    return tempstr.rstrip()

GDB_PUB_TLMA_PARCELS_ASSESSOR = r"C:\Users\RFAIRHUR\AppData\Roaming\ESRI\Desktop10.3\ArcCatalog\GDB_PUB.sde\GDB_PUB.RIVCO.PARCELS_ASSESSOR"
PARCELS_ASSESSOR_Layer = "Parcel_Assessors_Layer"

# Process: Make Feature Layer (2)
arcpy.MakeFeatureLayer_management(GDB_PUB_TLMA_PARCELS_ASSESSOR, PARCELS_ASSESSOR_Layer, "APN < 'RIVER' AND MAIL_TO_ZIP > -1 AND ZIP > -1", "", "OBJECTID OBJECTID VISIBLE NONE;APN APN VISIBLE NONE;FLAG FLAG VISIBLE NONE;MAIL_TO_NAME MAIL_TO_NAME VISIBLE NONE;MAIL_TO_STREET MAIL_TO_STREET VISIBLE NONE;MAIL_TO_CITY MAIL_TO_CITY VISIBLE NONE;MAIL_TO_ZIP MAIL_TO_ZIP VISIBLE NONE;HOUSE_NO HOUSE_NO VISIBLE NONE;DIR DIR VISIBLE NONE;STREET STREET VISIBLE NONE;SUFFIX SUFFIX VISIBLE NONE;CITY CITY VISIBLE NONE;ZIP ZIP VISIBLE NONE;HOUSE_SUFF HOUSE_SUFF VISIBLE NONE;UNIT UNIT VISIBLE NONE;REALUSE REALUSE VISIBLE NONE;PRIMARY_CODE PRIMARY_CODE VISIBLE NONE;SECONDARY_CODE SECONDARY_CODE VISIBLE NONE;MULTI_CODE MULTI_CODE VISIBLE NONE;TRACT TRACT VISIBLE NONE;ACRE ACRE VISIBLE NONE;RECORDERS_TYPE RECORDERS_TYPE VISIBLE NONE;RECORD_BOOK RECORD_BOOK VISIBLE NONE;RECORD_PAGE RECORD_PAGE VISIBLE NONE;CNTY_CODE CNTY_CODE VISIBLE NONE;LOT_TYPE LOT_TYPE VISIBLE NONE;LOT LOT VISIBLE NONE;BLOCK_NO BLOCK_NO VISIBLE NONE;SUB_NAME SUB_NAME VISIBLE NONE;CAME_FROM CAME_FROM VISIBLE NONE;TAXABILITY TAXABILITY VISIBLE NONE;TRA TRA VISIBLE NONE;LAND LAND VISIBLE NONE;STRUCTURE STRUCTURE VISIBLE NONE;OWNER1_LAST_NAME OWNER1_LAST_NAME VISIBLE NONE;OWNER1_FIRST_NAME OWNER1_FIRST_NAME VISIBLE NONE;OWNER1_MID_NAME OWNER1_MID_NAME VISIBLE NONE;OWNER2_LAST_NAME OWNER2_LAST_NAME VISIBLE NONE;OWNER2_FIRST_NAME OWNER2_FIRST_NAME VISIBLE NONE;OWNER2_MID_NAME OWNER2_MID_NAME VISIBLE NONE;OWNER3_LAST_NAME OWNER3_LAST_NAME VISIBLE NONE;OWNER3_FIRST_NAME OWNER3_FIRST_NAME VISIBLE NONE;OWNER3_MID_NAME OWNER3_MID_NAME VISIBLE NONE;SHAPE SHAPE VISIBLE NONE;SHAPE.STArea() SHAPE.STArea() VISIBLE NONE;SHAPE.STLength() SHAPE.STLength() VISIBLE NONE")

sourceFC = PARCELS_ASSESSOR_Layer

sourceFieldsList = ['APN','MAIL_TO_NAME','MAIL_TO_STREET','MAIL_TO_CITY','MAIL_TO_ZIP','HOUSE_NO','DIR','STREET','SUFFIX','CITY','ZIP','HOUSE_SUFF','UNIT','OWNER1_LAST_NAME','OWNER1_FIRST_NAME','OWNER1_MID_NAME','OWNER2_LAST_NAME','OWNER2_FIRST_NAME','OWNER2_MID_NAME','OWNER3_LAST_NAME','OWNER3_FIRST_NAME','OWNER3_MID_NAME']

# Use list comprehension that includes whitespace trimmming to build a dictionary from a da SearchCursor
valueDict = {r[0]:([stringCaseTrim(value) if str(type(value)) in ("<class 'str'>", "<type 'unicode'>") else value for value in r[1:]]) for r in arcpy.da.SearchCursor(sourceFC, sourceFieldsList)}  

PARCELS_ASSESSOR_OWNER1 = r"\\agency\agencydfs\Tran\FILES\GISData\rfairhur\Layers\Plus_Conversion_Data\Plus_Conversion_Data.gdb\PARCELS_ASSESSOR_OWNER1"

updateFC = PARCELS_ASSESSOR_OWNER1

updateFieldsList = ['APN','MAIL_TO_NAME','MAIL_TO_STREET','MAIL_TO_CITY','MAIL_TO_ZIP','HOUSE_NO','DIR','STREET','SUFFIX','PARCEL_CITY','ZIP','HOUSE_SUFF','UNIT','OWNER1_LAST_NAME','OWNER1_FIRST_NAME','OWNER1_MID_NAME','OWNER2_LAST_NAME','OWNER2_FIRST_NAME','OWNER2_MID_NAME','OWNER3_LAST_NAME','OWNER3_FIRST_NAME','OWNER3_MID_NAME','FREQUENCY','SUM_SHAPE_Area','pa_key','pa_key_1','pa_key_2','pa_key_3','pa_address_1','pa_address_2','pa_address_3','pa_zip','pa_notation','RELATION','CREATED','EFFECTIVE_FROM','EFFECTIVE_TO','DELETED','ORIG_FULL_CONTACT_ID','ORIG_ADDRESS_ID','CONTACT_ID','lms_key','IS_ACTIVE','ISBILLING','Temp_Contact_Name','TEMP_CONTACT_NAME_ID','ISCOMPANY','ISCONTACT','GLOBALENTITYNAME','FIRSTNAME','MIDDLENAME','LASTNAME','ADDRESSTYPE','COUNTRYTYPE','ADDRESSLINE1','PREDIRECTION','ADDRESSLINE2','STREETTYPE','POSTDIRECTION','UNITORSUITE','ADDRESSLINE3','CITY','STATE','POSTALCODE','COUNTY','COUNTRY','PROVINCE','POBOX','COMPSITE','RURALROUTE','ATTN','ISACTIVE','ZIP_1','PO_NAME', 'STATE_1']

changeCnt = 0

with arcpy.da.UpdateCursor(updateFC, updateFieldsList) as updateRows:  
    for updateRow in updateRows:
        # store the Join value of the row being updated in a keyValue variable  
        keyValue = updateRow[0]
        # verify that the keyValue is in the Dictionary  
        if keyValue in valueDict:  
            # compare record in the dictionary to the existing update table record
            if list(valueDict[keyValue]) <> updateRow[1:22]:
                # A change has been made to the mailing address, owner names or both
                changeCnt += 1
                for n in range (1,len(sourceFieldsList)):  
                    updateRow[n] = valueDict[keyValue][n-1]
‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍                updateRows.updateRow(updateRow)  
‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍
HillaryBjorstrom
Frequent Contributor

Thank you so much this is exactly what i am looking for. i will give it a try.

Happy holidays!

HillaryBjorstrom
Frequent Contributor

can you explain the apnDict. 

RichardFairhurst
MVP Alum

I have removed that line.  This code is a revised extract from a much longer script that had several other subroutines.  I missed that line and should have removed it before posting the code, since it is not relevant to the code you need.

HillaryBjorstrom
Frequent Contributor

Thanks the script will work but it still runs through each record and makes a change.  I feel so close. i see the keyvalue (updated feature) is being compared in the valueDict (source feature). I feel it is around this section i'm stuck 

if list(valueDict[keyValue]) <> updateRow[1:22]:
# A change has been made to the mailing address, owner names or both
changeCnt += 1

Thank you again.

Here is what i have.

import arceditor
import arcpy
from time import strftime

workspace ="E:\\arcgisserver\\Resources\\Connections\\GIS_SCRIPTINGVS.sde"


serviceFC = arcpy.MakeFeatureLayer_management("E:\\arcgisserver\\Resources\\Connections\\GIS_SCRIPTINGVS.sde\\.DBO.ELECTRIC\\.DBO.ServicePoint", "test","LINE_SECTION = 'PRIUG6819'","","OBJECTID OBJECTID VISIBLE NONE;AncillaryRole AncillaryRole VISIBLE NONE;Enabled Enabled VISIBLE NONE;SubtypeCD SubtypeCD VISIBLE NONE;Account_Number Account_Number VISIBLE NONE;SERVICE_MAP_LOCATION SERVICE_MAP_LOCATION VISIBLE NONE;MeterNumber MeterNumber VISIBLE NONE;TRANSFORMER_ID TRANSFORMER_ID VISIBLE NONE;MAP_NO MAP_NO VISIBLE NONE;LINE_SECTION LINE_SECTION VISIBLE NONE;SUBSTATION SUBSTATION VISIBLE NONE;FEEDER_NUMBER FEEDER_NUMBER VISIBLE NONE;SERVICE_STATUS SERVICE_STATUS VISIBLE NONE;Service_Number Service_Number VISIBLE NONE;CC_Number CC_Number VISIBLE NONE;NAME NAME VISIBLE NONE;SERVICE_ADDRESS SERVICE_ADDRESS VISIBLE NONE;SERVICE_ADDRESS2 SERVICE_ADDRESS2 VISIBLE NONE;CITY CITY VISIBLE NONE;STATE STATE VISIBLE NONE;DESCRIPTION DESCRIPTION VISIBLE NONE;DateModified DateModified VISIBLE NONE;InstallationDate InstallationDate VISIBLE NONE;Comments Comments VISIBLE NONE;ServiceCurrentRating ServiceCurrentRating VISIBLE NONE;LoadManagement LoadManagement VISIBLE NONE;MISC_CHAR_1 MISC_CHAR_1 VISIBLE NONE;MISC_CHAR_2 MISC_CHAR_2 VISIBLE NONE;OutagePriority OutagePriority VISIBLE NONE;Zip Zip VISIBLE NONE;Detail Detail VISIBLE NONE;KeyAccount KeyAccount VISIBLE NONE;Longitude Longitude VISIBLE NONE;Latitude Latitude VISIBLE NONE;Township Township VISIBLE NONE;Section_ Section_ VISIBLE NONE;T_S T_S VISIBLE NONE;created_user created_user VISIBLE NONE;created_date created_date VISIBLE NONE;last_edited_user last_edited_user VISIBLE NONE;last_edited_date last_edited_date VISIBLE NONE;GlobalID GlobalID VISIBLE NONE;CustomerID CustomerID VISIBLE NONE;StationGuid StationGuid VISIBLE NONE;phaseCode phaseCode VISIBLE NONE;ADD3 ADD3 VISIBLE NONE;ACCTBASE ACCTBASE VISIBLE NONE;PrimaryPhone PrimaryPhone VISIBLE NONE;BusinessPhone BusinessPhone VISIBLE NONE;MobilePhone MobilePhone VISIBLE NONE;CustID CustID VISIBLE NONE;CoopNumb CoopNumb VISIBLE NONE;CISPhase CISPhase VISIBLE NONE;MEDALERT MEDALERT VISIBLE NONE;SHAPE SHAPE VISIBLE NONE")

sourceFC="E:\\arcgisserver\\Resources\\ScriptingDB.gdb\\BISRV_BIPERCON"


print "Start script: " + strftime("%Y-%m-%d %H:%M:%S")


sourceFieldsList = ["BI_SRV_MAP_LOC","SRVADDR1","SRVADDR2","SRVST","SRVCITY","SRVZIP","BUSINESS","HOME","BICUST","MOBILE","BI_FORMAT_NAME","BI_FORMAT_ADDL_NAME","BI_KEY_CUST_CD","BI_ROUTE_CD","BI_OUT_PRI_CD","BI_MED_NEC_CD","BI_SRV_LOC","BI_AR_STAT","BI_ACCT", "BI_MTR_PHS", "BI_TRF_NBR", "BI_MTR_NBR"]

edit = arcpy.da.Editor(workspace)

edit.startEditing(False, True)
edit.startOperation()

valueDict = {r[0]:(r[1:])for r in arcpy.da.SearchCursor(sourceFC, sourceFieldsList)}

updateFC = "E:\\arcgisserver\\Resources\\Connections\\GIS_SCRIPTINGVS.sde\\GISDBO.ELECTRIC\\.DBO.ServicePoint"

updateFieldsList = ["SERVICE_MAP_LOCATION","SERVICE_ADDRESS","SERVICE_ADDRESS2","STATE","CITY","Zip","BusinessPhone","PrimaryPhone","CustID","MobilePhone","NAME","MISC_CHAR_1","KeyAccount","MISC_CHAR_2","OutagePriority","MEDALERT","Service_Number","SERVICE_STATUS","Account_Number", "CISPhase","TRANSFORMER_ID","MeterNumber"]


changeCnt = 0


with arcpy.da.UpdateCursor(updateFC, updateFieldsList) as updateRows:
   for updateRow in updateRows:

         keyValue = updateRow[0]

# verify that the keyValue is in the Dictionary
          if keyValue in valueDict:
# compare record in the dictionary to the existing update table record
                  if list(valueDict[keyValue]) <> updateRow[1:22]:
# A change has been made to the mailing address, owner names or both
                  changeCnt += 1
                  for n in range (1,len(sourceFieldsList)):
                        updateRow = valueDict[keyValue][n-1]
                   updateRows.updateRow(updateRow)
                    print changeCnt
# Stop the edit session and save the changes
## edit.stopEditing(True)
## edit.startEditing(False, True)
## edit.startOperation()

del valueDict
#arcpy.DeleteManagement("test")
edit.stopOperation()
print "edits complete"
# Stop the edit session and save the changes
edit.stopEditing(True)
arcpy.ClearWorkspaceCache_management()
print "Finished script: " + strftime("%Y-%m-%d %H:%M:%S")
print "Finished script: " + strftime("%Y-%m-%d %H:%M:%S")

RichardFairhurst
MVP Alum

By my count you have a list of 21 fields, so I am surprised you are not getting an indexing error when you try to compare updateRow[1:22].  Indexes are 0 based and the top index of the pair should be 21, not 22, if my count is correct.  It should be updateRow[1:21] (you are skipping index 0 and therefore only have 20 items in the list, and since the final index has to be 1 greater that the last actual index, which is 20, it should be 21).

Your tabbing also is hard to read.  You should use the Syntax highlighter.  Take note of the tabbing in the code below and make sure each tab level matches my code.

In order for a compared record to be bypassed it must be an exact duplicate of the record in the other data source.  Are you sure that all of the field types being compared are the same and that the fields in the two lists are perfectly matched up?  For example, if one data source has a numeric field, but the field you are matching in other data source is a string, the record values would never match.  You would have to make the value type of every field identical to the value type field you are matching before you could compare the records.

Perhaps the easiest way to compare the records is to print them when the changeCnt is 0.  For debug purposes I would rewrite the code to the following and compare the two records for differences, misalignment of comparison values, etc.  Preferably you could also add a query to the update cursor declaration that would only return a record that you expected to have an exact match in the other data source.

with arcpy.da.UpdateCursor(updateFC, updateFieldsList) as updateRows:
   for updateRow in updateRows:
       keyValue = updateRow[0]
       # verify that the keyValue is in the Dictionary
       if keyValue in valueDict:
           # compare record in the dictionary to the existing update table record
           if list(valueDict[keyValue]) <> updateRow[1:21]:
# Comment out this if clause once you account for differences in the records.                
               if changeCnt == 0:
                   print(list(valueDict[keyValue]))
                   print(updateRow[1:21])
                   break
               # A change has been made to the mailing address, owner names or both
               changeCnt += 1
# temporarily comment out updates until you can compare the records
#               for n in range (1,len(sourceFieldsList)):
#                   updateRow = valueDict[keyValue][n-1]
#                   updateRows.updateRow(updateRow)
#                   print changeCnt‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

Alternatively, you could rewrite the code to just compare a single field to see how many records update and then gradually increase the number of fields you are comparing until you discover the field that is mismatched or has values that differ in the two sources for every row.  It you found a consistent difference (like " " vs "") you could standardize the gdb version to have the same format as the sde version.

The structure of the code seems sound, so the most likely causes of the kind of bug you are experiencing are in the specific implementation for the data you are processing, like misaligned field lists, field type differences in the data sources, or value formatting differences in the data sources.

HillaryBjorstrom
Frequent Contributor

Thank you so much. I didn't realize the field types would matter. that is probably the issue.

I did run the code as it was and it didn't fail so i think the edits are still less than before. 

I will try the single field at a time. Thank you so much for your help on this. I refer to this blog a ton and really appreciate the work you put into it and also the quick response with help on this. Thanks again!!

ChaimSchwartz3
Occasional Contributor

Hi Richard,

Thank you very much for this great post. Just wanted to let you know that using your code on a versioned .sde I was able to reduce the time for my workflow from 8:56 minutes (using model builder and field calculator) to 3:13 minutes, which is a great improvement, updating 19 fields for ~1000 records within a +500,000 records database.

Thanks!

Chaim

ChaimSchwartz3
Occasional Contributor

Thanks Hillary  - your questions and Richard's answers were helpful to me as well. Since you were mentioning your program crashed, and since I believe adding the "if" check will only be able to improve your speed to some extent, I was wondering if you are running the code within an ArcMap session? When running Richard's code on a +500,000 records versioned .sde on my not-too-sophisticated computer at work it will also crash at times, but running it as a standalone script (Through IDLE) works.

RichardFairhurst
MVP Alum

Chaim:

Thanks for letting me know how this post helped you.  When you start loading a large number or records and fields to a dictionary, python can run out of memory.  I would expect running the code in an ArcMap session would be more likely to trigger the error than running it in Idle.  Having a lot of other applications open can sometimes contribute to the memory issue.  I also have found that if you have code that needs to load several large dictionaries that Python only releases the memory of a dictionary you no longer need if you loaded and processed the dictionary within a def method and not in the main program.  Once the def method completes and returns control to the main program the dictionary memory loaded by the method is released.

HillaryBjorstrom
Frequent Contributor

No, i run it in python within a edit session. The if check is working great for me and eliminated the crashing. Are you setting your edit session as    

edit = arcpy.da.Editor(arcpy.env.workspace)
edit.startEditing(False, True)
edit.startOperation()

I had True, True at first and it wasn't working right. The False, True worked better on my versioned database.

The one thing i had to do was export the schema i was wanting to update into a temp database then load the data i wanted to compare into it. so the fields i was comparing were exactly the same type, size, etc.

ChaimSchwartz3
Occasional Contributor

Thanks! for me running it in or outside of ArcMap definitely made a change.

HillaryBjorstrom
Frequent Contributor

Hi I know this is an old post. but i was wondering the code above 

                for n in range (1,len(sourceFieldsList)):  
                    updateRow[n] = valueDict[keyValue][n-1]
                updateRows.updateRow(updateRow) 

shows the update.Rows.updateRow(updateRow) as dedented region. but the code in the previous shows it indented. Does this matter in the for loop? will it commit each update. Thanks!

RichardFairhurst
MVP Alum

The updateRows method should occur in the dedented position that is outside of the for loop.  The for loop is iterating through the fields of a row and assigning each field with the corresponding item in a list which is stored as the value of the shared dictionary key.   The row should only be updated once after all of the field values of the row have been have been transferred to the row from the dictionary.

KendraHopper2
Emerging Contributor

Hello!

There's no question that this is my #1 most visited post. It's provided a wealth of information and feedback that has enabled me to automate an essential business process. So, a big thank you to all who have constructively contributed!

I'm wondering if anyone would be willing to peer-review the python script that I have put together. As with many GIS shops, I'm the only one with the esri/python knowledge, and it would be fantastic to have a second set of eyes to look for efficiencies (and yes, I'll hesitate to admit, potential errors...). The script is running like a charm, updating, adding and deleting records of a related table within a geometric network. 

Unfortunately, I have recently noticed that there could be some corruption to the related table indexing, or there is something inherently broken with the "Related Table" tool. I have a support ticket open with esri, but they do not support scripting. 

What is the best platform for peer-review of scripting? I don't want to dilute this post since it is by far the most concentrated knowledge for the use of these cursors that I am aware of. Do you have any suggestions as to how/where I could get a second option on my code?

Thanks in advance,

Jen

HillaryBjorstrom
Frequent Contributor

Thank you so much!

RichardFairhurst
MVP Alum

First of all, thank you for letting me know that this code and the discussions it has inspired have been extremely useful for streamlining your business processes.  Since I have created this Blog, I have applied the concepts introduced here repeatedly myself.  I want to keep this Blog focused on the core concepts of this technique that deal with the most common problems everyone can understand, but I have found numerous other ways to expand and adapt them to solve much more complex and unique problems that someday may justify the creation of a separate advanced level Blog Post.

As far as your code, you should create a new forum post under the Python/Geoprocessing forum and include the code you want to have reviewed.   Your post in that forum can be as unique to your needs as you want.  I monitor that forum group, and if you mention my name or my blog post in your post I will certainly take a look at your code when I have time.  Also, a lot of other Python users contribute to that forum that know more than I do about other aspects of Python and its optional modules who may offer alternative approaches or methods for solving your particular problem.  Good luck.

KendraHopper2
Emerging Contributor

Of course! Great advice. Thank you!

NasrullahKhan
Emerging Contributor

Hi Richard, I came across this post while searching for a solution to my case listed in post "https://community.esri.com/thread/242367-need-help-on-python-script ". the code basically works but take longtime to process. I am new to python and need some guidance to improve my code, if possible could you please spare sometime to have a look on my post and through some light on it, thanks.

JeffWard
Honored Contributor

I'm thinking the transition to the new forum platform has messed with your code samples - it looks like newlines and white space didn't come across very well.

DrewDowling
Frequent Contributor

@RichardFairhurstI just want to say thank you so much for this post. I've been using it a reference for years.

EricMahaffey1
Frequent Contributor

@RichardFairhurstI came across your blog while looking for a way to extract a couple of field values from a single field, and append them together (in a string) in another table/field within a single cell.  From what I've been reading I need to use a dictionary to store the values from the SearchCursor, then UpdateCursor to populate the cell with the dictionary values.  Does this sound correct?  In my mind it seems so straightforward, but getting it to actually work is a different story.  Any thoughts?

RichardFairhurst
MVP Alum

This blog provides a method of data transfer which in its simplest form loads a dictionary from a source and then uses an updateCursor on the target to transfer values where each dictionary key value in the source has an identical matching key value in your target. 

It sounds like you are wanting to do a slightly more advanced version of transfer where you are summarizing values from the source and transferring them to the target where a many to one relationship exists between the source and the target on the key field.  This can be done, and the closest example I provide in the blog to your use case is covered in the Using a Python Dictionary Built using a da SearchCursor to Replace a Summary Statistics Output Table.  However, the example needs substantial modification to achieve the summary equivalent of a Join merge rule that outputs a delimited list of values when multiple features in the source relate to a single feature in the target based on a common attribute key value.

I would need to know more about your specific needs to be sure I have clear understanding of what you want to achieve before I could provide you with sample code that fits your scenario.

ChaseSchieffer
Emerging Contributor

Hi Richard,

Thank you for your in-depth blog on cursors. I've used them in a number of different projects for updating feature classes where the data is one-to-one.

I'm wondering if you know of a way to use cursors with related tables. For example, I'm trying to figure out how to search through a one-to-many related table, find the latest related record, and update a field in the related feature class.

I have a feature class with a key field called FacilityID, and I have a related table with a key field called HydrantID. One hydrant can have many records in the related table. I would like to search through the related records and find the latest one - I have editor tracking enabled, so I'd probably be able to use the created date or last edited date. In the related table is a status field, and I'd like to find the latest related record, take that status field value, and update a field in the feature class called Latest Status with the status from the related record.

Is something like this possible?

DonMorrison1
Frequent Contributor

Chase - this should be very straight forward (assuming I understand correctly. You can create an update cursor on your feature class and loop through every feature. For every feature you grab the Facility ID then create a search cursor on the related table with a WHERE clause the selects only those related records associated with the current Facility ID - you can even specify a sort order on the  search cursor so the related records come back in descending date-order. Take the status field from the first record returned on the search cursor, copy it to the feature record and call updateRow.

BlakeTerhune
MVP Frequent Contributor

@ChaseSchieffer, using the methods outlined by Richard in this blog article, you would first need to read all the related data into a dictionary. I can't figure out how you would do it with max() on the date field using a SearchCursor so you could use ArcSDESQLExecute() if your data is in an rdbms. Otherwise, you can just manipulate the data with Python in a SearchCursor. That would look something like this.

 

# Build dictionary with only the most recent related records.
latest_related_data = {}
with arcpy.da.SearchCursor(related_table, ["rel_id", "rel_date", "other_rel_field", "etc"]) as s_cursor:
    for rel_id, rel_date, other_rel_field, etc in s_cursor:
        related_record = latest_related_data.get(rel_id)
        # Only update the entry if it's a new id or the date is newer than what's there.
        if not related_record or rel_date > related_record["rel_date"]:
            latest_related_data[rel_id] = {
                "rel_date": rel_date,
                "other_rel_field": other_rel_field,
                "etc": etc
            }

# Update the hydrant feature class with something from the latest related record.
with arcpy.da.UpdateCursor(hydrant_fc, ["id", "other_field"]) as u_cursor:
    for id, other_field in u_cursor:
        # Find the related record for this hydrant id.
        # This will error if the hydrant id doesn't have a related record.
        # Maybe use dict.get() instead?
        latest_related_record = latest_related_data[id]
        # Do something with the data.
        # You can get data with individual field names from the related table.
        print(f"The latest related data for Hydrant ID {id} ({other_field})is from {latest_related_record['task_date']}")
        print(f"\tother_rel_field{latest_related_record['other_rel_field']}")
        print(f"\tetc{latest_related_record['etc']}")

 

RichardFairhurst
MVP Alum

@ChaseSchieffer, The last example in my blog covering using a dictionary to replace a summary table applies to your situation, with a little adaptation to your specific needs.  If your main feature class had a LatestStatusDate field you could also update that to go along with your LatestStatus value for the record.  

November 12, 2021:  I enhanced the code performance by adding an if condition that only updates the master table record when the related table date in the dictionary is the more recent than the one in the LatestStatusDate field or the status value of the dictionary is different from the one held in the LatestStatus field.  This gives a significant performance boost if you have two large sets of records to process and match, but in reality only a few records actually need to be updated due to value changes.

 

 

from time import strftime

print( "Start script: " + strftime("%Y-%m-%d %H:%M:%S"))

import arcpy

sourceFC = r"C:\Path\RelateTable"

sourceFieldsList = ["HydrantID", "CreateDate", "LastEditDate", "Status"]

# Build a summary dictionary from a da SearchCursor with unique key values
# of a field storing a list of the latest date and status.
valueDict = {}
with arcpy.da.SearchCursor(sourceFC, sourceFieldsList) as searchRows:
    for searchRow in searchRows:
        keyValue = searchRow[0]
        if not keyValue in valueDict:
            # assign a new keyValue entry to the dictionary with the 
            #latest date of the Created or Last Edit Date field.
            lastDate = searchRow[1]
            if searchRow[2] > lastDate:
                lastDate = searchRow[2]
            valueDict[keyValue] = [lastDate, searchRow[3]]
        else:
            # update the date and status of an existing keyValue entry in
            # the dictionary if the current record's 
            # Created or Last Edit Date is the Latest date.
            newDate = searchRow[1]
            if newDate < searchRow[2]:
                newDate = searchRow[2]
            if valueDict[keyValue][0] < newDate:
                valueDict[keyValue] = [newDate, searchRow[3]]  


updateFC = r"C:\Path\UpdateFeatureClass"

updateFieldsList = ["FacilityID", "LatestStatusDate", "LatestStatus"]

with arcpy.da.UpdateCursor(updateFC, updateFieldsList) as updateRows:
    for updateRow in updateRows:
        # store the Join value of the row being updated in a keyValue variable
        keyValue = updateRow[0]
        # verify that the keyValue is in the Dictionary
        if keyValue in valueDict:
            if updateRow[1] < valueDict[keyValue][0] or updateRow[2] != valueDict[keyValue][1]:
                # Transfer the data
                updateRow[1] = valueDict[keyValue][0] # Latest Status Date
                updateRow[2] = valueDict[keyValue][1] # Latest Status
                updateRows.updateRow(updateRow)


del valueDict

print( "Finished script: " + strftime("%Y-%m-%d %H:%M:%S"))

 

 

DonMorrison1
Frequent Contributor

I'm a big believer in using python dicts to optimizing the use of cursors, especially when the access patterns are more random than sequential. However in Chase's case it seems that using the python dict makes it more complicated than it needs to be.  Unless there is a big performance advantage using the dicts (and there might very well be),  this is how I would do it.

 

import arcpy

relTable = r"C:\Path\RelateTable"
updateFC = r"C:\Path\UpdateFeatureClass"

with arcpy.da.UpdateCursor(updateFC, ["FacilityID", "LatestStatus"]) as updateRows:
    for f_id, old_status in updateRows:        
        with arcpy.da.SearchCursor(relTable, 
                                   ["Status"], 
                                   where_clause="FacilityIDRel=%s" % f_id,
                                   sql_clause=(None, "ORDER BY last_edited_date DESC")
                                   ) as searchRows:
            for (new_status,) in searchRows:
                updateRows.updateRow(f_id, new_status)
                break

 

 

RichardFairhurst
MVP Alum

Using Embedded cursors is not a good idea for any data that will continue to grow into more than 1,000 records that match.  Each cursor reopens the table on disk and processes all records in the table during each loop to isolate the records specified by the query filter, so the Embedded cursors result in an exponential performance hit as the record sets grows.  Using an attribute index on all filter fields gives a performance boost, but that boost gets overwhelmed when the number of unique values it needs to filter grows beyond a certain point.

Dictionaries give a huge performance enhancement over query filters when there are a huge number of unique values to access.  Dictionaries are random access with no real performance delay for accessing the 1st record or the 1 millionth record.  The Maximum time it has taken me to load 40 fields from 1 million records into a dictionary (including the shape field) is 10 minutes and that allows me to match them to 1 million records in an update table in a single pass.  While I haven't done this in my blog, I frequently gain another performance enhancement by adding a python expression that compares the 40 field values so I only write updates to the actual records that changed (typically 1,000 records or less).  With that enhancement I can complete the updates in under another 10 minutes.  The comparison expression is up to 20 times faster than a cursor writing to a record that isn't actually changing, especially in large tables.  The base code in my blog and the code I provided in my latest response can be revised so that it dramatically speeds up again if your record set is very large and the number of records actually getting updated is small by only writing to records that actually changed.  It only takes a revision of one or two lines of code so I will go back and add that.

ChaseSchieffer
Emerging Contributor

@RichardFairhurst, Thank you for the code example. I was able to successfully implement it on a set of test data. The only issue I had was Null Values in the Latest_Status_Date in the Hydrant feature class. I created this field before running the script, so all of them are initially set to Null. When I ran the script I got an error referencing line 45: TypeError: can't compare datetime.datetime to NoneType.

Initially, my solution was to simply field calculate a date of  1/1/2000 in the Latest_Status_Date since I knew that there wouldn't be any inspection records before then, and this would give the script a date to compare to.

However, I wound up adding an additional OR operation to line 45. So it now looks like:

if updateRow[1] is None or updateRow[1] < valueDict[keyValue][0] or updateRow[2] != valueDict[keyValue][1]:

 As I understand it, this means that if the Latest_Status_Date is Null, then transfer the data from the dictionary anyway if there is any.

The use case for this script is essentially to be able to symbolize the latest data from a related table in Field Maps. For example, if the field crew creates a new inspection of a hydrant, and the hydrant needs additional work, I can now create a symbol for that. The script will run every half hour or so to update the hydrant feature class with the latest related data, and I will have a feature service set up to add a ring symbol around any hydrant that has a status of Needs Work. This way the field crew can see visually in Field Maps what Hydrants they need to go back to for additional work in the future without having to click on hydrants and viewing their inspections to see which ones need additional work.

Thanks again,

Chase

LydiaRill
New Contributor

This has changed my processing time from 4 hours (using the Join tool in arcpy) to 5 minutes. This is a great substitute for joining a CSV to an attribute table. Thank you! I wish I saw this blog years ago!

Labels