Updating a feature layer with a global sql update

1182
19
05-21-2014 11:31 PM
MarkWingfield
New Contributor III
Currently, I have a feature layer that I step through row by row and update each row in the layer with output data from a SQL function. This works but is slow. I have been asked to try and speed this up and the suggestion was to attempt some form of global update in one go tat could link the feature class to a function or view and update all rows in one go. The data returned from the current function relies on inputs from the current row to create the returned data. I have attached the current code below:

    message = "[INFO]  Looping through selected RoadNetwork started at " + str(datetime.datetime.now())
    print message
    arcpy.AddMessage(message)
    LogFile.write(message + "\n")
    # Call function to get extra columns
    cursor = arcpy.UpdateCursor(FinalNetworkLayerName)
    for featurerow in cursor:
        currentRow = currentRow + 1
        if currentRow%100 == 0:
            message = "[INFO]  Processed " + str(currentRow) + " of " + str(RoadNetworkCount) + " rows at " + str(datetime.datetime.now())        
            print message
            arcpy.AddMessage(message)
            LogFile.write(message + "\n")
        CurrentToid = featurerow.TOID
        # Establishing connection to MSSQL database

        sql = "SELECT * FROM [DWOP].[dbo].[ufunc_ReturnGGFData] ('" + CurrentToid + "'," + OfficeIDParm
        # Call function
        sde_return = sdeConn2.execute(sql)
        if isinstance(sde_return, list):
            for row in sde_return:
                featurerow.SEG_ID = row[0]
                featurerow.STL_NAME = row[1]
                featurerow.STR_NAME = row[2]
                featurerow.ST_CLASS = row[3]
                featurerow.ORIG_LVL = row[4]
                featurerow.DEST_LVL = row[5]
                featurerow.ONEWAY = row[6]
                featurerow.SPEED1 = row[7]
                featurerow.SPEED2 = row[8]
                featurerow.SPEED3 = row[9]
                featurerow.SPEED4 = row[10]
                featurerow.TRN_MODE = row[11]
                featurerow.BTH_MODE = row[12]
                featurerow.RESTR_ORI = row[13]
                featurerow.RESTR_DEST = row[14]
                cursor.updateRow(featurerow)

    del cursor, featurerow

As I stated above. this works but is quite slow (averaging over 2 seconds per row update).

It seems unlikely to me that such a global update could be done but I am new to ArcPy so may be missing something.

Cheers

Mark
Tags (2)
19 Replies
RichardFairhurst
MVP Honored Contributor
Currently, I have a feature layer that I step through row by row and update each row in the layer with output data from a SQL function. This works but is slow. I have been asked to try and speed this up and the suggestion was to attempt some form of global update in one go tat could link the feature class to a function or view and update all rows in one go. The data returned from the current function relies on inputs from the current row to create the returned data. I have attached the current code below:

    message = "[INFO]  Looping through selected RoadNetwork started at " + str(datetime.datetime.now()) 
    print message
    arcpy.AddMessage(message)
    LogFile.write(message + "\n")
    # Call function to get extra columns
    cursor = arcpy.UpdateCursor(FinalNetworkLayerName)
    for featurerow in cursor:
        currentRow = currentRow + 1
        if currentRow%100 == 0:
            message = "[INFO]  Processed " + str(currentRow) + " of " + str(RoadNetworkCount) + " rows at " + str(datetime.datetime.now())         
            print message
            arcpy.AddMessage(message)
            LogFile.write(message + "\n")
        CurrentToid = featurerow.TOID
        # Establishing connection to MSSQL database

        sql = "SELECT * FROM [DWOP].[dbo].[ufunc_ReturnGGFData] ('" + CurrentToid + "'," + OfficeIDParm 
        # Call function
        sde_return = sdeConn2.execute(sql)
        if isinstance(sde_return, list):
            for row in sde_return:
                featurerow.SEG_ID = row[0]
                featurerow.STL_NAME = row[1]
                featurerow.STR_NAME = row[2]
                featurerow.ST_CLASS = row[3]
                featurerow.ORIG_LVL = row[4]
                featurerow.DEST_LVL = row[5]
                featurerow.ONEWAY = row[6]
                featurerow.SPEED1 = row[7]
                featurerow.SPEED2 = row[8]
                featurerow.SPEED3 = row[9]
                featurerow.SPEED4 = row[10]
                featurerow.TRN_MODE = row[11]
                featurerow.BTH_MODE = row[12]
                featurerow.RESTR_ORI = row[13]
                featurerow.RESTR_DEST = row[14]
                cursor.updateRow(featurerow)

    del cursor, featurerow

As I stated above. this works but is quite slow (averaging over 2 seconds per row update).

It seems unlikely to me that such a global update could be done but I am new to ArcPy so may be missing something.

Cheers

Mark


First of all, use the Code tags (the # button) around Python code to preserve the spacing in a forum post.

If you are using ArcMap 10.0 or above convert your UpdateCursor to a Data Access Update cursor.  They are at least 10 times faster than the old style cursors.  Also, supply a field list that limits the fields read to only access the fields you need.

About the only way to speed it up is to loop through all selected features with a SearchCursor first and create a list or dictionary of all ObjectIDs, then input the full list to a single query to the MS SQL database using an IN operation.  Then create a dictionary of the entire read of the MS SQL data and all its related fields in one pass using the same ObjectID value as the key.  Then do the update code, but use the dictionary key to find the related data instead of performing a query.  Dictionary searches are up to 100 times faster than SQL queries.  For example, if there are 100 selected features this approach would reduce 101 queries to 3 queries and 2 list/dictionary reads.  That should gain a large performance boost.

Queries are expensive because they have to use physical hardware and network throughput to access and transfer the data, while dictionaries are built in local memory.  So 100 random access hits by query are much slower than 100 random access hits to a dictionary in memory.  A single query of a 100 ObjectID list will be faster, because the cursor read is not random, but sequential, and field indexing has a better effect through 1 query than through 100 queries, since the index is only traversed once for 1 query, but has to be traversed 100 times for 100 queries.  (I assume you have an index on all of the related fields your queries hit, since unindexed fields should never be used to create a relationship between 2 database tables.)

If the ObjectID list is crazy huge you may need to chunk it into selection groups with the modulus operator so that the SQL where clause is not excessively large.  Then query that group back to a separate layer copy of the input selection layer.  The layer copy would be created outside of your loop using the Make Feature Layer operation.

If this is helpful, see my signature line for info on voting for helpful posts.
0 Kudos
JamesCrandall
MVP Frequent Contributor
If the ObjectID list is crazy huge you may need to chunk it into selection groups with the modulus operator so that the SQL where clause is not excessively large.  Then query that group back to a separate layer copy of the input selection layer.  The layer copy would be created outside of your loop using the Make Feature Layer operation.


This could certainly throw a wrench into the "IN" statement approach --- is there any issues with chunk processing part?  One potential alternative idea is to use a Table Variable instead.  If the list of OID's is really large you could first run an insert into the table variable then simply JOIN this to the table in your main query.  This design approach would be best fit into a StoredProcedure rather than dynamic SQL though (not a bad plan either as the SQL statement will have already been compiled on the server).
0 Kudos
RichardFairhurst
MVP Honored Contributor
This could certainly throw a wrench into the "IN" statement approach --- is there any issues with chunk processing part?  One potential alternative idea is to use a Table Variable instead.  If the list of OID's is really large you could first run an insert into the table variable then simply JOIN this to the table in your main query.  This design approach would be best fit into a StoredProcedure rather than dynamic SQL though (not a bad plan either as the SQL statement will have already been compiled on the server).


I agree that really large ObjectID lists will be a significant factor in deciding the best optimization approach if that is a factor.  Such lists are even more of a reason to scrap the approach that creates a new query for every individual ObjectID.  Under any circumstance start by using the Data Access version of the Updatecursor, since that will give the code a huge boost and should immediately improve the 2 seconds per record performance which should largely be due to using an old style cursor.
0 Kudos
ChrisSnyder
Regular Contributor III
Some comments and examples:

1. Use a dictionary, as Richard said its way faster than using imbedded cursors, and your bosses will of course give you a raise, since you improved the performance so much.
2. If the dictionary gets "too big"... that is it occupies more than ~2.1 GB of RAM, which is the (sort of) limit of 32-bit Python, you can use 64-bit Python along with the 64-bit "ArcGIS_BackgroundGP_for_Desktop_101sp1.exe", which basically turns arcpy (v10.1+) into a 64-bit bad a$$ capable of using gobs of RAM.

Here's an (untested) example of that dictionary calculaty sort of thing... It sucks and existing table into a dictionary, and then updates the table by calcing the ID field value = to the sum of all the (original) ID values that are >= the ID value being processed. Why would you would want to do that? I don't know, but it illustrates the general concept.

myTable = r"C:\temp\test.gdb\test"
valueDict = {r[0]:(r[1]) for r in arcpy.da.SearchCursor(myTable, ["OID@","ID"])} 
updateRows = arcpy.da.UpdateCursor(myTable, ["OID@","ID"])
for updateRow in updateRows:
   oidVal, idVal = updateRow
   updateRow[1] = sum([i[0] for i in valueDict if valueDict[0] >= idVal])
   updateRows.updateRow(updateRow)
del updateRow, updateRows 
0 Kudos
RichardFairhurst
MVP Honored Contributor
Some comments and examples:

1. Use a dictionary, as Richard said its way faster than using imbedded cursors, and your bosses will of course give you a raise, since you improved the performance so much.
2. If the dictionary gets "too big"... that is it occupies more than ~2.1 GB of RAM, which is the (sort of) limit of 32-bit Python, you can use 64-bit Python along with the 64-bit "ArcGIS_BackgroundGP_for_Desktop_101sp1.exe", which basically turns arcpy (v10.1+) into a 64-bit bad a$$ capable of using gobs of RAM.

Here's an (untested) example of that dictionary calculaty sort of thing... It sucks and existing table into a dictionary, and then updates the table by calcing the ID field value = to the sum of all the (original) ID values that are >= the ID value being processed. Why would you would want to do that? I don't know, but it illustrates the general concept.

myTable = r"C:\temp\test.gdb\test"
valueDict = {r[0]:(r[1]) for r in arcpy.da.SearchCursor(myTable, ["OID@","ID"])} 
updateRows = arcpy.da.UpdateCursor(myTable, ["OID@","ID"])
for updateRow in updateRows:
   oidVal, idVal = updateRow
   updateRow[1] = sum([i[0] for i in valueDict if valueDict[0] >= idVal])
   updateRows.updateRow(updateRow)
del updateRow, updateRows 


Chris:

Good point.  The entire related table can just be read into a dictionary so that it has the complete set of ID keys without using any where clause.  If that was done before processing the update cursor then the update loop can just match to the dictionary keys.  That way it doesn't matter how big the ObjectID list is and no where clause needs to be generated at all.  It also cuts the number of cursor loops down to just two cursor loops that are completely independent of each other.
0 Kudos
ChrisSnyder
Regular Contributor III
After reading your original post, all you are doing is basically a join/calc... the performance of which still can be improved using a dictionary/update cursor. This is some older code (doesn't use .da cursors, dictionary comprehensions, etc.) but it illustrates the traditional join and calc sort of thing you are trying to accomplish.

http://forums.arcgis.com/threads/9555-Modifying-Permanent-Sort-script-by-Chris-Snyder?p=30010&viewfu...
0 Kudos
RichardFairhurst
MVP Honored Contributor
After reading your original post, all you are doing is basically a join/calc... the performance of which still can be improved using a dictionary/update cursor. This is some older code (doesn't use .da cursors, dictionary comprehensions, etc.) but it illustrates the traditional join and calc sort of thing you are trying to accomplish.

http://forums.arcgis.com/threads/9555-Modifying-Permanent-Sort-script-by-Chris-Snyder?p=30010&viewfu...


The only time I have found that this style is necessary is in a label expression that pulls data from a related table with a One to Many or Many to Many relationship.  Because the label expression is repeatedly read for every label generated, almost like the expression is within a cursor loop, there is no benefit to reading the entire related table into a dictionary within the label expression.  The dictionary would just be regenerated for every label and no cursor reads would be eliminated.
0 Kudos
MarkWingfield
New Contributor III
I think I get the giste of what you are all saying but I am brand new to Pythons and am still struggling.

Given the code I had before and the fact that a slightly altered version of the SQL function (hopefully to fit in with a data dictionary solution) returns one row of data in the following format with the OBJECTID for the current row as an input:

SEG_ID int,
STL_NAME varchar(100),
STR_NAME varchar(100),
ST_CLASS char(1),
ORIG_LVL int,
DEST_LVL int,
ONEWAY int,
SPEED1 float,
SPEED2 float,
SPEED3 float,
SPEED4 float,
TRN_MODE char(8),
BTH_MODE char(8),
RESTR_ORI varchar(2000),
RESTR_DEST varchar(2000),
             TOID varchar(16),
             OBJECTID id

My current code using da.UpdateCursor instead of the standard UpdateCursor is:

    fields = ("SEG_ID", "STL_NAME", "STR_NAME", "ST_CLASS", "ORIG_LVL", "DEST_LVL", "ONEWAY", "SPEED1", "SPEED2", "SPEED3", "SPEED4", "TRN_MODE", "BTH_MODE", "RESTR_ORI", "RESTR_DEST", "TOID", "OBJECTID")
    cursor = arcpy.da.UpdateCursor(FinalNetworkLayerName, fields)
    for featurerow in cursor:
        CurrentObjectID = featurerow[17]
        # Establishing connection to MSSQL database

        sql = "SELECT * FROM [DWOP].[dbo].[ufunc_ReturnGGFDataObjectID] (" + CurrentObjectId + "," + OfficeIDParm 
        # Call function
        sde_return = sdeConn2.execute(sql)
        if isinstance(sde_return, list):
            for row in sde_return:
                featurerow[0] = row[0]
                featurerow[1] = row[1]
                featurerow[2] = row[2]
                featurerow[3] = row[3]
                featurerow[4] = row[4]
                featurerow[5] = row[5]
                featurerow[6] = row[6]
                featurerow[7] = row[7]
                featurerow[8] = row[8]
                featurerow[9] = row[9]
                featurerow[10] = row[10]
                featurerow[11] = row[11]
                featurerow[12] = row[12]
                featurerow[13] = row[13]
                featurerow[14] = row[14]
                cursor.updateRow(featurerow)

    del cursor, featurerow


I am looking at a data dictionary definition of :

rdlkDict = dict([(r.OBJECTID, (r.SEG_ID, r.STL_NAME, r.STR_NAME, r.ST_CLASS, r.ORIG_LVL, r.DEST_LVL, r.ONEWAY, r.SPEED1, r.SPEED2, r.SPEED3, r.SPEED4, r.TRN_MODE, r.BTH_MODE, r.RESTR_ORI, r.RESTR_DEST, r.TOID)) for r in arcpy.SearchCursor(FinalNetworkLayerName)])


How would I populate this from the SQl call and then how do I link this to the FinalNetworkLayerName feature class? I cannot get my head around how the loops work and relate to each other to update 15 columns from the SQL call via the data dictionary.

Sorry, I am completely new to Python.

Cheers
0 Kudos
ChrisSnyder
Regular Contributor III
So 1st thing is to get your MSSQL table into a dictionary... I don't use MYSQL, but I image you have some sort of database connection file, right? Also, in my example below, I am assuming the MySQL table has a field called OBJECTID that you are using as the join item, but that's probably not the case. OBJECTID is generally a poor choice for a join field since it is not stable. Anyway...

mySqlTable = r"\\mynetwork\myconnectionfile.sde"
mySqlFieldsToReturnList = ["OBJECTID", "SEG_ID", "STL_NAME", "STR_NAME", ...]
rdlkDict = {r[0]:(r[1:]) for r in arcpy.da.SearchCursor(mySqlTbl, mySqlFieldsToReturnList)}


So now you should have this awesome dictionary where the key is the join item, and all the field values
that you will need in your update cursor are readily available at light speed. The cool thing now is that the dictionary can look up the values from the SQL table a kazzilion times faster than the "sql = "SELECT * FROM [DWOP].[dbo].[ufunc_ReturnGGFDataObjectID" thingy was.
0 Kudos