Optimize performance with arcpy UpdateCursor on large polygon Feature Class and Table

AndrewVitale · ‎07-20-2022

I'm attempting to update a numeric field using an arcpy.da.UpdateCursor. I'm applying the update to a Feature Layer and Table, which participate in a One to Many Relationship. The Feature layer is quite large, with 1,219,495 Polygon records. The table has even more records at 4,735,679 rows.

The update should only be applied if certain conditions are met. If the Child Table's Provider field has a value of "Optimum", it needs updated. If the Parent Feature's MaxDownload field is 1,000, it needs updated. I'm doing the queries first via SearchCursor, and saving the relevant IDs that satisfy the query. Then, in an edit session, I'm attempting to update the relevant fields using a where_clause to select the OIDs I need.

import datetime as dt

import arcpy

def log_it(msg):
    print(f"[{dt.datetime.now()}]: {msg}")
    return


def query_str_in_list(field_name, values):
    vals = "'" + "', '".join(values) + "'"
    return f"{field_name} IN ({vals})"


def query_int_in_list(field_name, values):
    vals = ", ".join([str(v) for v in values])
    return f"{field_name} IN ({vals})"


def main():
    start_time = dt.datetime.now()
    print(f"Starting script at {start_time}")

    arcpy.env.workspace = r"C:\dev\arcgis\NYS-DPS\NysDpsBroadbandMap\2022-06-01_Analysis\2022-06-01_Analysis\NysDpsBroadbandMap.gdb"

    global_ids = []
    provider_oids = []
    with arcpy.da.SearchCursor(
        "StandardProviders",
        ["ParentGlobalID", "OBJECTID"],
        where_clause="Provider = 'Optimum'",
    ) as search_cursor:
        for row in search_cursor:
            global_ids.append(row[0])
            provider_oids.append(row[1])

    polygon_oids = [
        row[0]
        for row in arcpy.da.SearchCursor(
            "StandardSummaryPolygons",
            ["OBJECTID"],
            where_clause=query_str_in_list("GlobalID", global_ids)
            + " AND MaxDownload = 1000",
        )
    ]

    log_it(
        f"Beginning to edit {len(provider_oids)} Provider Records and {len(polygon_oids)} Polygon Records"
    )

    with arcpy.da.Editor(arcpy.env.workspace) as edit_session:
        with arcpy.da.UpdateCursor(
            "StandardSummaryPolygons",
            ["OID@", "MaxDownload"],
            where_clause=query_int_in_list("OBJECTID", polygon_oids),
        ) as polygon_update_cursor:
            for idx, row in enumerate(polygon_update_cursor):
                log_it(f"{idx} Polygon Iterations")
                oid = row[0]
                if oid in polygon_oids:
                    polygon_update_cursor.updateRow([oid, 940])

        with arcpy.da.UpdateCursor(
            "StandardProviders",
            ["OID@", "Download"],
            where_clause=query_int_in_list("OBJECTID", provider_oids),
        ) as provider_update_cursor:
            for idx, row in enumerate(provider_update_cursor):
                log_it(f"{idx} Provider Iterations")
                oid = row[0]
                if oid in provider_oids:
                    provider_update_cursor.updateRow([oid, 940])

    end_time = dt.datetime.now()
    duration = end_time - start_time
    print(f"Finished at {end_time} after executing for {duration}")
    return


if __name__ == "__main__":
    main()

The script is going quite slow, and it appears that the Polygon Feature Class is the root cause. When iterating through the cursor, each iteration is taking 2 seconds. Iterating through the table gives the performance I would expect.

Here's a log snippet:

## Parent Table Log (Polygon Features)

Starting script at 2022-07-20 14:13:51.937132
[2022-07-20 14:13:59.535107]: Beginning to edit 276149 Provider Records and 212815 Polygon Records
[2022-07-20 14:13:59.801700]: 0 Polygon Iterations
[2022-07-20 14:14:01.015486]: 1 Polygon Iterations
[2022-07-20 14:14:02.363565]: 2 Polygon Iterations
[2022-07-20 14:14:03.780736]: 3 Polygon Iterations

## Child Table Log (Table Rows)

[2022-07-20 14:19:10.037031]: 6740 Provider Iterations
[2022-07-20 14:19:10.037031]: 6741 Provider Iterations
[2022-07-20 14:19:10.038029]: 6742 Provider Iterations
[2022-07-20 14:19:10.038029]: 6743 Provider Iterations

Am I simply running into physical limits due to the volume of the data? Are there any performance gains that could be implemented here?

Thanks!

AndrewVitale · ‎07-20-2022

I believe this was a case of a corrupted file geodatabase.

While debugging, I usually try to copy/paste the layers into a new file geodatabase, which has helped deal with corruption in the past. That did not work here. I went back to the FGDB source and got a fresh copy of the data, and performance returned to normal.

View solution in original post

PDodd · ‎08-03-2023

Andrew,

One thing we discovered recently is that Geo Processing tasks will record their processing details to the XML Metadata log file inside the File Geo Database (fGDB)! See https://pro.arcgis.com/en/pro-app/latest/arcpy/functions/setlogmetadata.htm

This can be very heavy if you retain the same fGDB and process over and over through time. The table will continue to grow, slowing the workflow as the tools have to open/read/parse/insert/and write the text XML data. I would recommend including 'arcpy.SetLogMetadata( False)' in your script to turn this feature off during your processing. You will also need to create a new fGDB and copy over the GDB objects to clean up everything. Running 'compact' has no effect on this table.

The fGDB file that tends to grow is 'a00000004.gdbtable' inside the GDB file folder.

View solution in original post

Anonymous User · ‎07-20-2022

It might be that it just requires more database writing for a polygon datatype than a plain table. Change where the print statement is and see what line of code is taking its time.

DonMorrison1 · ‎07-20-2022

I agree that you should log before and after the if statements at lines 60/71 and also immediately after the cursor updates.

Another option might be to try MakeFeatureLayer/MakeTableView (assuming you can somehow select the records with a query) then CalculateFIeld to set those fields.

AndrewVitale · ‎07-20-2022

I believe this was a case of a corrupted file geodatabase.

While debugging, I usually try to copy/paste the layers into a new file geodatabase, which has helped deal with corruption in the past. That did not work here. I went back to the FGDB source and got a fresh copy of the data, and performance returned to normal.

BlakeTerhune · ‎07-21-2022

I know you were able to find a solution, but a couple things I want to mention:

If you're only working in a file geodatabase (not a versioned feature class in an enterprise geodatabase) then I don't think you need to start and stop an edit session.
You open the update cursors with a where clause, and then check the values again the same way as the where clause. It's redundant.
An alternative experiment to try instead of using the update cursors: make a feature layer with the same where clause and run calculate field.

EDIT: I just noticed @DonMorrison1 already mentioned using calculate field.

AndrewVitale · ‎07-22-2022

Thanks for the tips!

On 1), I do need the edit session due to the Relationship Class. ArcPy raises an Exception otherwise.

On 2), agreed, it's redundant. It's an artifact from my original approach, where I was trying to used the UpdateCursor to loop through _all_records. I doubt the Python List lookup was the issue, though.

And on 3), I wish I could get away with the GUI. The actually migration will require some additional computations to re-summarize some affected records.

BlakeTerhune · ‎07-22-2022

Use the GUI? You can script calculate field with Python.

PDodd · ‎07-21-2022

The DA cursor normalizes the field data when accessing a featureclass. I've seen a significant difference between DA and the Legacy Cursors when accessing polygon featureclasses. The More complex the geometries, the worse it is.

Try a read test using the legacy cursor and see if you notice any difference. The legacy cursors retain the feature geometry in their binary state until you access them, where the DA cursor explodes them into a native Python structure.

AndrewVitale · ‎07-22-2022

This is an interesting thought! I had stumbled across this StackOverflow answer regarding why arcpy.da cursors are so much faster: https://gis.stackexchange.com/a/108809/19883

If you pop open the slides the dev links to, it does appear that arcpy.da geometry queries are relatively slow compared to the rest of the data access lib's performance. I would have assumed that the shape doesn't need to be parsed if you're not accessing it, but the perf difference between feature layers and tables definitely indicates otherwise.

I tried with the regular cursor, and it does seem to be a bit faster iterating through my dataset. However, given that I'm operating on a relatively small portion of the overall table, I think the `where_clause` on a data access cursor will be the wisest choice.

Thanks for the thoughts.

PDodd · ‎08-03-2023

Andrew,

One thing we discovered recently is that Geo Processing tasks will record their processing details to the XML Metadata log file inside the File Geo Database (fGDB)! See https://pro.arcgis.com/en/pro-app/latest/arcpy/functions/setlogmetadata.htm

This can be very heavy if you retain the same fGDB and process over and over through time. The table will continue to grow, slowing the workflow as the tools have to open/read/parse/insert/and write the text XML data. I would recommend including 'arcpy.SetLogMetadata( False)' in your script to turn this feature off during your processing. You will also need to create a new fGDB and copy over the GDB objects to clean up everything. Running 'compact' has no effect on this table.

The fGDB file that tends to grow is 'a00000004.gdbtable' inside the GDB file folder.