Any tips on making this process faster?

ZoeZaloudek · ‎11-06-2012

Here's what I'm doing - I have a polygon feature class of census blocks with populations. I have split these census blocks based on watershed boundaries (side, note, many of the blocks fall completely within a watershed and were not split). Now... each piece of each block that was split up still has the total population for that block attributed to it. I'm trying to figure out an estimate of the population for each piece of each block. For each piece of each block, I'm calculating the percent of a block's total area that piece is, and finally applying that percent-of-area to the block's total population.

I am running this for an entire state. By my estimate, if the script takes 16 seconds to calculate each census block, it will take nearly 84 days to run... That's no good.

I know there are a lot of cursors running, but I can't think of a better way to do this off the top of my head. The cursor(s) should be looking at a table view, which normally has only 1 - 3 records.

'blockids' is a list with each unique BLOCKID. "finalblocks" is the variable standing for my census blocks feature class.

blockcounter = 1 for blockid in blockids:     # Make Table View of all polygons with this blockid     block_view = 'block_view' + blockid     wclause = '"BLOCKID10" = \'' + blockid + "'"     arcpy.MakeTableView_management (finalblocks, block_view, wclause)     bcount = arcpy.GetCount_management(block_view)     print str(bcount) + ' block pieces for ' + blockid + '... ' + str(blockcounter) + ' of ' + str(len(blockids))     rows = arcpy.UpdateCursor(block_view)     for row in rows:         blockpop = str(row.POP10)     if blockpop <> '0':         if str(bcount) <> '1':             # If this block ID has multiple pieces and total population is not 0             totalarea = 0.0             totalpop = 0.0             for row in rows:                 totalarea += row.Shape_Area             for row in rows:                 thisarea = row.Shape_Area                 pctarea = thisarea / totalarea                 totalpop = row.POP10                 thispop = totalpop * pctarea                 row.POP10_est = thispop                 rows.updateRow(row)         else:             # If there is only one polygon for this block ID             for row in rows:                 row.POP10_est = row.POP10                 rows.updateRow(row)     else:         # If the total population of this block is 0 anyway         for row in rows:             row.POP10_est = 0             rows.updateRow(row)     del rows, row     blockcounter += 1

Thank you in advance for any help!

RhettZufelt · ‎11-08-2012

blockcounter = 1 for blockid in blockids:     # Make Table View of all polygons with this blockid     block_view = 'block_view' + blockid     wclause = '"BLOCKID10" = \'' + blockid + "'"     arcpy.MakeTableView_management (finalblocks, block_view, wclause)     bcount = arcpy.GetCount_management(block_view)     print str(bcount) + ' block pieces for ' + blockid + '... ' + str(blockcounter) + ' of ' + str(len(blockids))     rows = arcpy.UpdateCursor(block_view)     totalarea = 0.0         for row in rows:         blockpop = str(row.POP10)         totalarea += row.Shape_Area       if blockpop <> '0':         if str(bcount) <> '1':             # If this block ID has multiple pieces and total population is not 0             totalpop = 0.0             for row in rows:                 thisarea = row.Shape_Area                 pctarea = thisarea / totalarea                 totalpop = row.POP10                 thispop = totalpop * pctarea                 row.POP10_est = thispop                 rows.updateRow(row)         else:             # If there is only one polygon for this block ID             for row in rows:                 row.POP10_est = row.POP10                 rows.updateRow(row)     else:         # If the total population of this block is 0 anyway         for row in rows:             row.POP10_est = 0             rows.updateRow(row)     del rows, row     blockcounter += 1

Thank you in advance for any help!

Dictionary is like a key:value lookup table: http://docs.python.org/2/tutorial/datastructures.html#dictionaries

I would have to agree with using a list/dict so you are loading it into memory, espcially if you have a lot of blockid's as making table views is time intensive.

Not sure how it would affect the rest of the code, but the more for loops, the longer to execute. See above, should be able to eliminate the totalarea for loop by including it above for loop. Since you are already looping through the entire set, just as well extract these values while there.
Also, comment out/remove the print statement. Print statements in the loops make it run WAAAY slower than without them.

R_

View solution in original post

JeffBarrette · ‎11-07-2012

If you are on 10.1, have you considered using the new Data Access module (arcpy.UpdateCursor vs arcpy.da.UpdateCursor)?

http://resources.arcgis.com/en/help/main/10.1/#/What_is_the_data_access_module/018w00000008000000/

The cursors are considerably faster.

Jeff

BradPosthumus · ‎11-07-2012

When working with large datasets I typically follow this workflow:

1. Use a search cursor to load the entire dataset into a Python dictionary (i.e. into memory).
2. Process all of the records using the data in the dictionary.
3. Use an update cursor to update the entire dataset with the results.

I'm not sure what's the bottleneck in your code but the above will save the time consumed by MakeTableView and GetCount processes and the creation of an update cursor for every block id.

Let me know if some pseudocode would help explain ths further.

ZoeZaloudek · ‎11-07-2012

If you are on 10.1, have you considered using the new Data Access module (arcpy.UpdateCursor vs arcpy.da.UpdateCursor)?

http://resources.arcgis.com/en/help/main/10.1/#/What_is_the_data_access_module/018w00000008000000/

The cursors are considerably faster.

Jeff

Actually, I am using 10.0. I'll talk to our GIS Manager about getting 10.1... If/when I am able to upgrade, I'll look into using the Data Access module.

Thanks Jeff!

ZoeZaloudek · ‎11-07-2012

When working with large datasets I typically follow this workflow:

1. Use a search cursor to load the entire dataset into a Python dictionary (i.e. into memory).
2. Process all of the records using the data in the dictionary.
3. Use an update cursor to update the entire dataset with the results.

I'm not sure what's the bottleneck in your code but the above will save the time consumed by MakeTableView and GetCount processes and the creation of an update cursor for every block id.

Let me know if some pseudocode would help explain ths further.

I think I understand what you're saying... basically, get the data from the feature class's table into a list or array (or is 'dictionary' a term that means something else?), and then do the heavy-duty processing from there...? If you have the time, some pseudocode would probably help... It looks like I am going to have to set this project aside for the time being, to work on some higher-priority items. I'll post again when I am able to try out your idea.

Thanks Brad!

RhettZufelt · ‎11-08-2012

blockcounter = 1 for blockid in blockids:     # Make Table View of all polygons with this blockid     block_view = 'block_view' + blockid     wclause = '"BLOCKID10" = \'' + blockid + "'"     arcpy.MakeTableView_management (finalblocks, block_view, wclause)     bcount = arcpy.GetCount_management(block_view)     print str(bcount) + ' block pieces for ' + blockid + '... ' + str(blockcounter) + ' of ' + str(len(blockids))     rows = arcpy.UpdateCursor(block_view)     totalarea = 0.0         for row in rows:         blockpop = str(row.POP10)         totalarea += row.Shape_Area       if blockpop <> '0':         if str(bcount) <> '1':             # If this block ID has multiple pieces and total population is not 0             totalpop = 0.0             for row in rows:                 thisarea = row.Shape_Area                 pctarea = thisarea / totalarea                 totalpop = row.POP10                 thispop = totalpop * pctarea                 row.POP10_est = thispop                 rows.updateRow(row)         else:             # If there is only one polygon for this block ID             for row in rows:                 row.POP10_est = row.POP10                 rows.updateRow(row)     else:         # If the total population of this block is 0 anyway         for row in rows:             row.POP10_est = 0             rows.updateRow(row)     del rows, row     blockcounter += 1

Thank you in advance for any help!

Dictionary is like a key:value lookup table: http://docs.python.org/2/tutorial/datastructures.html#dictionaries

I would have to agree with using a list/dict so you are loading it into memory, espcially if you have a lot of blockid's as making table views is time intensive.

Not sure how it would affect the rest of the code, but the more for loops, the longer to execute. See above, should be able to eliminate the totalarea for loop by including it above for loop. Since you are already looping through the entire set, just as well extract these values while there.
Also, comment out/remove the print statement. Print statements in the loops make it run WAAAY slower than without them.

R_

ZoeZaloudek · ‎11-28-2012

Using python dictionaries/lists helped a lot! Thanks to everyone for your answers. This part of the script still had to be run over night, but that is way faster than what I was doing before.

# Make some dictionaries and lists for population and area
totalpopdict = {}
areadict = {}
totalareadict = {}
allblockids = []
allhuc8s = []
rows = arcpy.SearchCursor(finalblocks)
for row in rows:
    blockid = str(row.BLOCKID10)
    thisblockpop = str(row.POP10)
    totalpopdict[blockid] = thisblockpop
    thisblockarea = row.Shape_Area
    thisblockhuc8 = str(row.HUC_8)
    thisblockidhuc8 = blockid + '_' + thisblockhuc8
    areadict[thisblockidhuc8] = thisblockarea
    if blockid in totalareadict:
        oldtotalarea = totalareadict[blockid]
        newtotalarea = thisblockarea + oldtotalarea
        totalareadict[blockid] = newtotalarea
    else:
        totalareadict[blockid] = thisblockarea
    allblockids.append(blockid)
    allhuc8s.append(thisblockhuc8)
del rows, row
gc.collect()

# Make lists of the unique blockids and huc8s
blockids = []
[blockids.append(value) for value in allblockids if not blockids.count(value)]
huc8s = []
[huc8s.append(value) for value in allhuc8s if not huc8s.count(value)]

# Calculate a dictionary with number of block pieces
gc.collect()
blkpiecesdict = {}
thisblockhuc8s = areadict.keys()
for blockid in blockids:
    count = 0
    for thisblockhuc8 in thisblockhuc8s:
        if thisblockhuc8[:15] == blockid:
            count += 1
    blkpiecesdict[blockid] = count
print 'gathered needed info into lists and dictionaries'

# Calculate a list with population estimates
gc.collect()
popestlist = []
for blockid in blockids:
    for huc8 in huc8s:
        thisblockidhuc8 = blockid + '_' + huc8
        if thisblockidhuc8 in areadict:
            totalpop = float(totalpopdict[blockid])
            totalarea = float(totalareadict[blockid])
            numpieces = blkpiecesdict[blockid]
            if numpieces > 1:
                thisarea = areadict[thisblockidhuc8]
                pctarea = thisarea / totalarea
                pctpop = pctarea
                thispop = totalpop * pctpop
            elif numpieces == 1:
                thispop = totalpop
                thisarea = totalarea
                pctpop = 1.0
                pctarea = 1.0
            else:
                thispop = 0
                thisarea = 0.0
                pctpop = 0.0
                pctarea = 0.0
            line = [blockid, huc8, totalpop, thispop, pctpop, totalarea, thisarea, pctarea, numpieces, thisblockidhuc8]
            popestlist.append(line)

# Make a CSV out of the data we have (to be joined to the finalblocks feature class)
gc.collect()
f = open(csv, 'w')
headerline = 'UniqueID,BLOCKID,HUC8,TotalPop,ThisPop,PctPop,TotalArea,ThisArea,PctArea,NumPieces,BlkID_HUC8' + '\n'
f.write(headerline)
for line in popestlist:
    logline = str(popestlist.index(line)) + ',' +  str(line[0]) + ',' + str(line[1]) + ',' + str(line[2]) + ',' + str(line[3]) + ',' + str(line[4]) + ',' + str(line[5]) + ',' + str(line[6]) + ',' + str(line[7]) + ',' + str(line[8]) + ',' + str(line[9])
    logline = logline + '\n'
    f.write(logline)
f.close()

# Convert the CSV into GDB table
desc = arcpy.Describe(csv)
fields = desc.fields
fieldinfo = arcpy.FieldInfo()
for field in fields:
    print field.name
    fieldinfo.addField(field.name, field.name, 'VISIBLE', '')
csv_view = 'csv_' + dtime
arcpy.MakeTableView_management (csv, csv_view, '', '', fieldinfo)
arcpy.TableToTable_conversion (csv, FGDB, csv_view, '', '', '')
newtable = FGDB + '\\' + csv_view
newtable_view = 'newtable_' + dtime
arcpy.MakeTableView_management (newtable, newtable_view)

# Make Table View of the finalblocks feature class
finalblocks_view = 'finalblocks_' + dtime
arcpy.MakeTableView_management (finalblocks, finalblocks_view)

# Join the CSV to the finalblocks feature class based on blockid
arcpy.JoinField_management (finalblocks_view, 'BlkID_HUC8', newtable_view, 'BlkID_HUC8')
print 'Join performed'

# Calculate the POP10_est field
arcpy.CalculateField_management (finalblocks_view, 'POP10_est', '[ThisPop]', 'VB')
print 'POP10_est field calculated'
print datetime.datetime.today()

RhettZufelt · ‎11-28-2012

Glad to hear it is working better.

Noticed there is still one print statement inside a for loop:

for field in fields:
    print field.name
    fieldinfo.addField(field.name, field.name, 'VISIBLE', '')

remove that will speed it up, but probably still take overnight.

R_

MathewCoyle · ‎11-28-2012

Using python dictionaries/lists helped a lot! Thanks to everyone for your answers. This part of the script still had to be run over night, but that is way faster than what I was doing before.

Probably won't make it much faster, but this line is just ugly.

logline = str(popestlist.index(line)) + ',' +  str(line[0]) + ',' + str(line[1]) + ',' + str(line[2]) + ',' + str(line[3]) + ',' + str(line[4]) + ',' + str(line[5]) + ',' + str(line[6]) + ',' + str(line[7]) + ',' + str(line[8]) + ',' + str(line[9])

Assuming you are printing every item in the line list variable, I'd use this.

logline = "%s," % popestlist.index(line) + ",".join(([str(item) for item in line]))

Your big time sink is most likely your join field process. I'd try to convert that to search cursor/dictionary/update cursor like others have suggested.

RhettZufelt · ‎11-28-2012

Also, depending on the size of you datasets, if you put an attributeindex on the fields being used to join each table, join speed can increase substatially.

R_