How can I add values to this Python dictionary more efficiently?

ChristopherBevilacqua · ‎02-08-2019

I've been building on some principles from Richard Fairhurst's Blog /blogs/richard_fairhurst/2014/11/08/turbo-charging-data-manipulation-with-python-cursors-and-diction... My goal is to map the average daily well volume for the past 90 days. I have a table of daily volumes for about 1,300 wells. There are several years worth of records in the table. I also have a Python dictionary. The keys in the dictionary are the unique identifier for each well represented in the table. The values in the dictionary start as empty lists. The following block of code populates the dictionary values with the daily volume values from the past 92 days for each well. The code works, but it takes approximately 40 seconds for each well. How can I modify the code to improve performance? Please see the code and comments below. Any suggestions would be greatly appreciated.

i = 0 # This is a counter for testing purposes
table = acDaily # This is a SQL Server database table of daily water volumes.  There are ~1300 wells represented in the table, each with daily records covering several years.
fields = ['PROPNUM','D_DATE','WATER'] # PROPNUM is the unique identifier for each well in acDaily, D_DATE is the date of the daily volume, and WATER is the volume of produced water.

for prop in propNumDict: # Iterate through each item in propNumDict.  propNumDict is a Python dictionary with keys that are the unique values in the acDaily PROPNUM field.  The dictionary values start as empty lists.
    i += 1
    print str(i) + " - " + strftime("%Y-%m-%d %H:%M:%S")  # This is used to time the duration of each iteration, which is currently ~40 seconds.
    if i > 2:
        break
    else:
        with arcpy.da.SearchCursor(table, fields) as cursor: # Use a search cursor to go through the records in acDaily and if the PROPNUM matches the key in propNumDict and the records are from the past 92 days, add the water volume value to the dictionary.
            for row in cursor:
                if row[0] == prop and row[1] > tMinus92: # tMinus92 is a variable defined as datetime.datetime.today() - timedelta(days = 92).  It takes two days for the data to make it into this table.
                    propNumDict[prop].append(row[2])

print propNumDict
print strftime("%Y-%m-%d %H:%M:%S")‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

After the propNumDict dictionary is populated, I have another block of code that creates a new dictionary and populates it with the average values for each key in propNumDict. Here's what that looks like:

avgDailyDict = {}
for key, values in propNumDict.iteritems():
    if float(len(values)) > 0:
        avgDailyDict[key] = sum(values)/float(len(values))

print avgDailyDict‍‍‍‍‍‍

arc gis python‌

dictionary‌

arcpy.da.searchcursor‌

RichardFairhurst · ‎02-08-2019

Your loops are backwards. The SearchCursor is always on the outer loop and you must do all lookups only on the dictionary in the inner loop. This way you just go through the searchcursor rows once and don't get killed instatiating a new cursor for every value in your dictionary. Also dictionaries are specifically designed to optimize the performance of lookup operations. You should also apply a query to the SearchCursor to limit it to just records where the D_DATE > tMinus92. The D_DATE field should be an indexed field to improve the performance of creating a SearchCursor that applies a query on that field. I am not bothering to time one value, since now the entire table will be completed in seconds or minutes, depending on the size of the table.

table = acDaily # This is a SQL Server database table of daily water volumes.  There are ~1300 wells represented in the table, each with daily records covering several years.
fields = ['PROPNUM','D_DATE','WATER'] # PROPNUM is the unique identifier for each well in acDaily, D_DATE is the date of the daily volume, and WATER is the volume of produced water.

print strftime("%Y-%m-%d %H:%M:%S")  # This is used to time the duration of the loop portion of the script.

#Build an SQL expression to limit D_DATE to within the past 92 days
SQL = "D_DATE > '{}'".format(tMinus92.strftime('%Y-%m-%d %H:%M:%S'))

# Use a search cursor to go through the records that are from the past 92 days in acDaily and if the PROPNUM matches the key in propNumDict, add the water volume value to the dictionary.
with arcpy.da.SearchCursor(table, fields, SQL) as cursor:
for row in cursor:
keyValue = row[0]
if keyValue in propNumDict:
   # propNumDict was generated previously and start as an empty list prior to running the SearchCursor
   propNumDict[keyValue].append(row[2])

print propNumDict
print strftime("%Y-%m-%d %H:%M:%S")‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

View solution in original post

DanPatterson_Retired · ‎02-08-2019

you would have a small sample of what the data looks like would you?

ChristopherBevilacqua · ‎02-11-2019

JoshuaBixby · ‎02-08-2019

You are looping over the entire data set with the cursor for each prop. I am guessing it takes roughly 40 seconds to loop through the entire data and do no processing of the data. There is a way to restructure the code to make it much more efficient, I just don't have the time at the moment to dive into it. What you could try with little code refactoring is to add an SQL WHERE clause on the search cursor to limit the records being returned to only those of the prop you are processing. It is much faster for the back-end data store to filter records than pass all the records to the client and filter.

RichardFairhurst · ‎02-08-2019

Your loops are backwards. The SearchCursor is always on the outer loop and you must do all lookups only on the dictionary in the inner loop. This way you just go through the searchcursor rows once and don't get killed instatiating a new cursor for every value in your dictionary. Also dictionaries are specifically designed to optimize the performance of lookup operations. You should also apply a query to the SearchCursor to limit it to just records where the D_DATE > tMinus92. The D_DATE field should be an indexed field to improve the performance of creating a SearchCursor that applies a query on that field. I am not bothering to time one value, since now the entire table will be completed in seconds or minutes, depending on the size of the table.

table = acDaily # This is a SQL Server database table of daily water volumes.  There are ~1300 wells represented in the table, each with daily records covering several years.
fields = ['PROPNUM','D_DATE','WATER'] # PROPNUM is the unique identifier for each well in acDaily, D_DATE is the date of the daily volume, and WATER is the volume of produced water.

print strftime("%Y-%m-%d %H:%M:%S")  # This is used to time the duration of the loop portion of the script.

#Build an SQL expression to limit D_DATE to within the past 92 days
SQL = "D_DATE > '{}'".format(tMinus92.strftime('%Y-%m-%d %H:%M:%S'))

# Use a search cursor to go through the records that are from the past 92 days in acDaily and if the PROPNUM matches the key in propNumDict, add the water volume value to the dictionary.
with arcpy.da.SearchCursor(table, fields, SQL) as cursor:
for row in cursor:
keyValue = row[0]
if keyValue in propNumDict:
   # propNumDict was generated previously and start as an empty list prior to running the SearchCursor
   propNumDict[keyValue].append(row[2])

print propNumDict
print strftime("%Y-%m-%d %H:%M:%S")‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍

ChristopherBevilacqua · ‎02-08-2019

Thanks for the feedback, Richard. This looks so obvious now. I had previously tried incorporating the date query as part of the search cursor, but hadn't been able to resolve SQL syntax errors. I still have my loops reversed though.

I'll give your solution a try on Monday, hopefully, and will reply with results.

RichardFairhurst · ‎02-08-2019

I reread your code and noticed your data source is coming from SQL Server. So I modified my code to change the original SQL expression which was using a date format that is compatible with file geodatabases to make it compatible with SQL Server by changing

SQL = "D_DATE > DATE '{}'".format(tMinus92.strftime('%Y-%m-%d %H:%M:%S'))

to

SQL = "D_DATE > '{}'".format(tMinus92.strftime('%Y-%m-%d %H:%M:%S'))

ChristopherBevilacqua · ‎02-11-2019

Thanks again, Richard. That works great. The cursor runs through all the rows and populates the dictionary in about 4 seconds!