CalculateFields (multiple) vs. da.UpdateCursor

6139
19
06-08-2020 05:04 AM
JakeJohnson
New Contributor III

Hello,

I've been trying to improve the scripts I have and stumbled across this legendary post:

/blogs/richard_fairhurst/2014/11/08/turbo-charging-data-manipulation-with-python-cursors-and-diction... 

However, I was wondering if CalculateFields as opposed to CalculateField would be even better. I don't know how it works behind the scenes -- is it just multiple CalculateField calls or more than that?

0 Kudos
19 Replies
JoeBorgione
MVP Emeritus

This Riker Maneuver?  Riker sits down - YouTube   

LOL...

That should just about do it....
JoshuaBixby
MVP Esteemed Contributor

There are several factors that can influence the speed at which Calculate Field(s) and cursors update data sets.  To eliminate disk caching and disk I/O factors, I did some testing with multi-million record data sets using memory workspaces.  I found that Calculate Fields ranged from 25-66% faster than using ArcPy DA update cursor.  Since I was testing using memory workspaces, all of the times were < 20 seconds, so I don't know if saving 3 or 5 seconds is really that big of a deal.  It would be interesting to see if Calculate Fields continues to outperform update cursors when using data in file geodatabases or enterprise geodatabases.

When updating data as part of a larger Python script, I will continue to use cursors because I find the structure of the code more logical and easier to read/maintain.  I can't imagine trying to pass a multi-line code block and multiple fields to Calculate Fields.  It might be technically doable, but it will be ugly as sin. 

JoeBorgione
MVP Emeritus

I can't imagine trying to pass a multi-line code block and multiple fields to Calculate Fields.  It might be technically doable, but it will be ugly as sin. 

^ This....

That should just about do it....
JakeJohnson
New Contributor III

Thanks for the test! That really cleared my questions up. It's probably not worth rewriting it but there is definitely a value in keeping your code Pythonic. My curiosity will get the better of me since all of my work is in FGDB's or enterprise/SDE's.

0 Kudos
JoshuaBixby
MVP Esteemed Contributor

For people who only use the GUI or Model Builder, Calculate Fields is a nice improvement.  For those of us that like to script, I think the added value is quite small.

RichardFairhurst
MVP Honored Contributor

The slowest approach for updating a single field is to use two embedded cursors, since the number of records read from the two tables are multiplied by each other (100K x 100K = 10,000,000,000 records).  The join and calculate field approach is easy to set up and apparently does use memory to optimize the calculation if the Join fields in both the source and target have an attribute index, so I use it when I am doing one time calculations and I am willing to wait about 5 to 10 times longer than the Turbo Charging approach.  If either table does not or cannot have an attribute index on the join field, the join and calculate field approach is not worth doing.  Anytime I am scripting a process that will have to run repeatedly it is always worth my time to set up and use the Turbo Charging approach, which reads each record of the two tables only once (100K + 100K = 200K records) and is the fastest at loading the source table into memory and updating the target records.  Attribute indexes on the Join fields have little to no effect on the Turbo Charging approach, but may slow the transfer if they are on the update fields for both the join and calculate approach and the turbo charging approach, so I try to add them after the data transfer is complete.

Another optimization of my turbo charging approach I use a lot is to add an if condition that compares the update values in the dictionary to the current value of the target and only perform updates on the records in the target table that have changed, especially when I know that only a small number of records have actually changed since the last update.  Doing comparisons in memory and targeting only the changed records is much faster than writing values to every record of the target table when relatively few values have actually changed.

JonathanNeal
Esri Contributor

Hi All there are a few items I can comment on here.

Most importantly, Calculate Field(s) should be faster than update cursors for a single field, as it never needs to convert C++ data types to the equivalent python ones and back again.

Also note, each time a tool connects to a data source it will open it up and close it.  So if you are updating a lot of fields, you would be slowed down by the opening and closing cost of the data each time you run the Calculate Field tool.  That is why we made the Calculate Fields tool to reduce the number of opening and closing costs to 1, and allow you to update multiple fields in one pass faster than a python cursor could ever do.

RichardFairhurst
MVP Honored Contributor

While I do not doubt that the Calculate Fields tool will update multiple fields in a single table faster than a da cursor for the reasons you state, nothing you have said gives me confidence that the Calculate Fields tool used on two joined tables outperforms a dictionary and a da cursor.  In the time the Field Calculator for a single field is still preparing to actually write its first value to the target table on a 1 million record table to 1 million record table join with attribute indexes on both joined fields (the green progress bar does not even start showing any progress for more than 20 to 40 minutes), the dictionary has read all of the records and multiple field values of the source 1 million records, the dictionary matches the current da UpdateCursor record on the join field nearly instantaneously and the da cursor has written all of the field values to the target table, even if no attribute index exists on either table.  Is there any improvement of how the join is processed by the Calculate Fields tool?  If not, the dictionary and cursor is the way to go. 

The dictionary  and da.UpdateCursor technique also lets me read from multiple source tables and perform all of their independent joins/matches and write all of their values to the target table in a single UpdateCursor pass.  I believe in that situation the advantage you have mentioned is taken away.  With a minor change to the code, records in the target table can first be compared to the dictionary record and limit updates to only those records that have actually changed, which can dramatically speed up the performance when only a few hundred or thousand records out of millions actually need updates.  Parallel dictionaries can be used to gather records that need to be deleted or inserted into the target table so that it will match the source table and those operations can be perfored immediately after the UpdateCursor is completed.  The dictionary can also act as a Summarize tool if necessary to transfer averages, sums, etc. if needed without using that tool prior to doing the data transfer.   I use all of the above options in my scripts and more. For those who are comfortable with advanced python programming the sheer number of possible variations the dictionary and da cursor technique can be adapted and optimized to handle should make it at least competitive if not preferable to any set of chained geoprocessing tools that can accomplish what it can do.

Also, where was this tool 6 years ago (and the many years before that when I needed a solution to this problem, but hadn't come across one) when my only option was to learn about dictionaries and I wrote all of my data transfer scripts. The 6 years of fantastic performance gains the dictionary and da.UpdateCursor technique gave me has changed my entire career and was the only practical way for me to really manage my large feature class and table data transfers during that time.  For someone who has never written a script before for data transfers like this the Calcualte Fields tool might be their first choice, but how does it hold up for a user like me that would have to break and rewrite dozens of very large and complex scripts if I wanted to incorporate the Calculate Fields tool as a dictionary and da cursor replacement?  To make it worth my while to rewrite my scripts you have to give me much more of a reason to believe that this new tool is a better solution after all that effort that will save me even more time if I use it in my scripts handling large data transfers among two or more tables going forward.

JonathanNeal
Esri Contributor

I tried the following to compare and had close results.  Also tried the join case and an additional search cursor and the results were the same

arcpy.management.CalculateFields("oneMil_SpatialJoin", "PYTHON3", "SomeDem1 1.0;SomeDem2 1.0;SomeDem3 1.0;") <result 'onemil_spatialjoin'=""> 52 seconds import time start = time.time() fc2 = 'oneMil_SpatialJoin' fields = ['TRACT','SomeDem1', 'SomeDem2', 'SomeDem3'] with arcpy.da.UpdateCursor(fc2, fields) as cursor:     for row in cursor:         temp = 1.0         row[1] = temp         row[2] = temp         row[3] = temp         cursor.updateRow(row) print(time.time()-start) 54.09208965301514
0 Kudos
RichardFairhurst
MVP Honored Contributor

I don't see anything in the code you showed that indicates you had two separate tables with a million records each joined together through a common attribute that transferred data from the join table to the target table and compared the time that took to the time it takes using a SearchCursor to upload the source table records to a dictionary and then transfer them to another table based on a common join field per the code in my /blogs/richard_fairhurst/2014/11/08/turbo-charging-data-manipulation-with-python-cursors-and-diction....  My issue has nothing to do with processing updates between different fields within the output of a SpatialJoin involving a million records, which is what your code appears to be doing.  I am convinced that the CalculateFields tool is better than a da.UpdateCursor when calculations need to occur on multiple fields within a single table or successive single field calculations, but that is a minor issue when compared to the time it takes to transfer data between two separate large tables to synchronize a target table to match the values in a source table on multiple fields.

Also, I find that frequently the target table cannot be duplicated in memory.  At least in my case, I often need to transfer data from a working dataset to another dataset that cannot be deleted or replaced, only updated.  Since writing to data stored on disk appears to add substantial time to the process that only makes finding the best performing operation for this kind of work even more critical.