How to use Python dictionaries and update cursor instead of ArcGIS Joins?

PeterWilson · ‎10-05-2012

I'm looking for a way to use python dictionaries instead of ArcGIS Joins. Please note that I'm new to Python and would need some assistance to understand your code if you don't mind and have the time.

I have 7.5 million parcles saved as a feature class. Within the feature class I have a field called "SG_Code". I also have two tables called WARMS (i.e. WARMS_DW760 & WARMS DW764). They each have a field called "SG_Code" & "TIT_DEED_NUM". I then have another two additional tables called RED (i.e. Redistribution) and REST (i.e. Restitution). The RED and REST tables have a two fields "SG_CODE" and "TIT_DEED_NUM".

I need to create a subset feature class of the 7.5 million parcles where I find a match using firstly the "SG_Code" between the parcles feature class and each WARMS table separately (i.e. WARMS_DW760 then WARMS_DW764). I then need to find a match using the original 7.5 million feature class and RED and REST tables using the "SG_Code". Then I need to find a match based on the match already found using the 7.5 million records between the WARMS_DW760 and WARMS_DW764 and then match the "TIT_DEED_NUM" and the "TIT_DEED_NUM" found in the RED and REST tables to see if I find additional matches using the "TIT_DEED_NUM" as not all the records have "SG_Codes" within the REST and RED tables.

In short, what I'm trying to accomplish is to identify where I find a match between the parcles and warms, then a match between the parcles and RED and REST.

I've used Add Joins so far to accomplish this, but its running forever. I've attached my model that I've built so far to better understand what I'm trying to accomplish.

Regards

DuncanHornby · ‎10-07-2012

Peter,

A dataset with 7.5 million Features is pretty large! What ever you do it will take time to process. Loading dictionaries with 7.5 million entries will probably cripple a machine anyway...

One quick win is to ensure your fields have attribute indices, have you created these? You can often get a significant performance boost with these alone.

Duncan

PeterWilson · ‎10-08-2012

Peter,

A dataset with 7.5 million Features is pretty large! What ever you do it will take time to process. Loading dictionaries with 7.5 million entries will probably cripple a machine anyway...

One quick win is to ensure your fields have attribute indices, have you created these? You can often get a significant performance boost with these alone.

Duncan

Hi Duncan

I have created attribute indices, unfortunately ArcGIS joins are very slow. I found the following post that suggested that python dictionaries outperform ArcGIS joins.

http://forums.arcgis.com/threads/55099-Update-cursor-with-joined-tables-work-around-w-dictionaries?p...

Regards

PhilMorefield · ‎10-10-2012

Hi Duncan

I have created attribute indices, unfortunately ArcGIS joins are very slow. I found the following post that suggested that python dictionaries outperform ArcGIS joins.

http://forums.arcgis.com/threads/55099-Update-cursor-with-joined-tables-work-around-w-dictionaries?p...

Regards

Check this post out for starters: http://forums.arcgis.com/threads/52511-Cool-cursor-dictionary-constructor-one-liner?p=179517#poststo...

If I recall, the method in that post only works for shapefiles with one attribute.

I agree with Duncan that 7.5 million features is unlikely to fit into memory all at once. I modified the one-liner from the above thread so that I could produce a fully-factorial dictionary that had all records and all attributes, and the performance was terrible. And that was for a shapefile with ~3000 polygons.

ChrisSnyder · ‎10-10-2012

Here is an actual example, and also some notes:

1. This method is WAY faster if you can use the data access cursors (new in v10.1)... Like about 20x faster!
2. As Duncan said, you may indeed run out of 32-bit memory with 7.5 million rows. I've experienced this... Remember that ints take up less RAM than floats way way less than strings.... especially long strings.
3. Unlike the example below, it would be way more compact to make just one dictionary and for each look up table, append to the attribute tuple/list that each key is pointing to... No need to have duplicate keys if only one key is required.

#Process: Build dictionaries of the basin size stats that we can then use in an update cursor - faster than join/calc
message = "Populating BASIN_SIZE_DICT..."; showPyMessage()
basinSizeDict = dict([(r.VALUE, (r.MIN, r.MAX, int(r.MEAN + .5))) for r in arcpy.SearchCursor(flowAccZoneStatTbl)]) #Int
message = "Populating SLOPE_DICT..."; showPyMessage()
slopeDict = dict([(r.VALUE, (r.MIN, r.MAX, int(r.MEAN + .5))) for r in arcpy.SearchCursor(slopePctZoneStatTbl)]) #Int
message = "Populating CURVE_PLAN3_DICT..."; showPyMessage()
curvePlan3Dict = dict([(r.VALUE, (r.MEAN)) for r in arcpy.SearchCursor(curvePlan3ZoneStatTbl)]) #Float
message = "Populating ELEV_DICT..."; showPyMessage()
elevationDict = dict([(r.VALUE, (r.MIN, r.MAX, int(r.MEAN + .5))) for r in arcpy.SearchCursor(elevationZoneStatTbl)]) #Int

#Process: Blah blah
##...State Secrets...

#Process: Update the streamFC table
message = "Updating stream table..."; showPyMessage()
updateRows = arcpy.UpdateCursor(streamFC)
for updateRow in updateRows:
    streamIdValue = updateRow.STREAM_ID
    if streamIdValue in outFlowStreamLinkList:
        updateRow.OUT_FLG = 1
    else:
        updateRow.OUT_FLG = -1
    updateRow.TILE_NO = int(areaId)
    updateRow.SUB_NET = streamNetworkIdDict[streamIdValue]
    updateRow.ACRE_MIN = basinSizeDict[streamIdValue][0] * cellSize ** 2 / 43560 #SQ FEET TO ACRES
    updateRow.ACRE_MAX = basinSizeDict[streamIdValue][1] * cellSize ** 2 / 43560 #SQ FEET TO ACRES
    updateRow.ACRE_MEAN = basinSizeDict[streamIdValue][2] * cellSize ** 2 / 43560 #SQ FEET TO ACRES
    del basinSizeDict[streamIdValue]
    updateRow.ELEV_MIN = elevationDict[streamIdValue][0]
    updateRow.ELEV_MAX = elevationDict[streamIdValue][1]
    updateRow.ELEV_MEAN = elevationDict[streamIdValue][2]
    del elevationDict[streamIdValue]
    updateRow.SLOPE_MIN = slopeDict[streamIdValue][0]
    updateRow.SLOPE_MAX = slopeDict[streamIdValue][1]
    updateRow.SLOPE_MEAN = slopeDict[streamIdValue][2]
    del slopeDict[streamIdValue]
    updateRow.PCRV3_MEAN = curvePlan3Dict[streamIdValue]
    del curvePlan3Dict[streamIdValue]    
    updateRows.updateRow(updateRow)
del updateRow, updateRows
message = "Done updating stream table!"; showPyMessage()

KimOllivier · ‎11-08-2012

I have found the interactive Merge tool with a fieldmap is a good substitute for a join. It works very efficiently in ArcMap but not in a script.

I just reopen the results in the saved MXD and rerun the tool again. It is so much faster that its worth the inconvenience once a month on a pair of files containing 2M records.

Maybe one day it will work in a script?

I did solve another problem with multiple joins never finishing by partitioning the task into 12 parts because there happened to be a key I could use for subsets. Each part ran in less than 10 minutes for under an hour of processing.

If you have a way of selecting subsets that may avoid exceeding your memory limits and still use the python dictionary trick. I limit my count to about a million records and it seems to be OK.

AbeHendricks · ‎01-12-2013

Here is an actual example, and also some notes:

1. This method is WAY faster if you can use the data access cursors (new in v10.1)... Like about 20x faster!
2. As Duncan said, you may indeed run out of 32-bit memory with 7.5 million rows. I've experienced this... Remember that ints take up less RAM than floats way way less than strings.... especially long strings.
3. Unlike the example below, it would be way more compact to make just one dictionary and for each look up table, append to the attribute tuple/list that each key is pointing to... No need to have duplicate keys if only one key is required.

#Process: Build dictionaries of the basin size stats that we can then use in an update cursor - faster than join/calc
message = "Populating BASIN_SIZE_DICT..."; showPyMessage()
basinSizeDict = dict([(r.VALUE, (r.MIN, r.MAX, int(r.MEAN + .5))) for r in arcpy.SearchCursor(flowAccZoneStatTbl)]) #Int
message = "Populating SLOPE_DICT..."; showPyMessage()
slopeDict = dict([(r.VALUE, (r.MIN, r.MAX, int(r.MEAN + .5))) for r in arcpy.SearchCursor(slopePctZoneStatTbl)]) #Int
message = "Populating CURVE_PLAN3_DICT..."; showPyMessage()
curvePlan3Dict = dict([(r.VALUE, (r.MEAN)) for r in arcpy.SearchCursor(curvePlan3ZoneStatTbl)]) #Float
message = "Populating ELEV_DICT..."; showPyMessage()
elevationDict = dict([(r.VALUE, (r.MIN, r.MAX, int(r.MEAN + .5))) for r in arcpy.SearchCursor(elevationZoneStatTbl)]) #Int

#Process: Blah blah
##...State Secrets...

#Process: Update the streamFC table
message = "Updating stream table..."; showPyMessage()
updateRows = arcpy.UpdateCursor(streamFC)
for updateRow in updateRows:
    streamIdValue = updateRow.STREAM_ID
    if streamIdValue in outFlowStreamLinkList:
        updateRow.OUT_FLG = 1
    else:
        updateRow.OUT_FLG = -1
    updateRow.TILE_NO = int(areaId)
    updateRow.SUB_NET = streamNetworkIdDict[streamIdValue]
    updateRow.ACRE_MIN = basinSizeDict[streamIdValue][0] * cellSize ** 2 / 43560 #SQ FEET TO ACRES
    updateRow.ACRE_MAX = basinSizeDict[streamIdValue][1] * cellSize ** 2 / 43560 #SQ FEET TO ACRES
    updateRow.ACRE_MEAN = basinSizeDict[streamIdValue][2] * cellSize ** 2 / 43560 #SQ FEET TO ACRES
    del basinSizeDict[streamIdValue]
    updateRow.ELEV_MIN = elevationDict[streamIdValue][0]
    updateRow.ELEV_MAX = elevationDict[streamIdValue][1]
    updateRow.ELEV_MEAN = elevationDict[streamIdValue][2]
    del elevationDict[streamIdValue]
    updateRow.SLOPE_MIN = slopeDict[streamIdValue][0]
    updateRow.SLOPE_MAX = slopeDict[streamIdValue][1]
    updateRow.SLOPE_MEAN = slopeDict[streamIdValue][2]
    del slopeDict[streamIdValue]
    updateRow.PCRV3_MEAN = curvePlan3Dict[streamIdValue]
    del curvePlan3Dict[streamIdValue]    
    updateRows.updateRow(updateRow)
del updateRow, updateRows
message = "Done updating stream table!"; showPyMessage()

Chris,

Can't thank you enough for all the posts relating to cursors and domains on this forum. Finally, our organization will be able to utilize cartographic representations, and do other sorts of everyday tasks such as update feature classes that have more than 100 records.

Cheers,
abe h.

ChrisSnyder · ‎01-14-2013

...and do other sorts of everyday tasks such as update feature classes that have more than 100 records.

Ha! That's funny... You are very welcome.

PeterWilson · ‎02-08-2013

Hi Chris

Firstly thank you for all your advice and assistance, its truly appreciated. Apologies for the late reply, been pretty busy the last while. I eventually used ArcGIS 10.1 Data Access Module through Python. It was a life saver when dealing with such a large dataset. The following video was exceptionally helpful in getting started with the Data Access Module:

Data Access Module

Regards

KimOllivier · ‎02-09-2013

I have just run out of memory with too large a dictionary containing large strings (132 chars) to update a table.
The symbology classification using quantile intervals gave me an idea of the way to split using the key into equal batches of records.
Here is a more useful Python function that did the job to partition the data splendidly.
quantile function
I was able to use this to create a list of breakpoints which then limited the dictionary size and updaterow counts to 250K each.
Each step ran in a few seconds for a total time of 3 minutes for 3 million records, instead of over an hour for smaller sets.
I needed to trap missing code errors.

dBreaks = {"parcel":[3272715, 3542277, 3812689, 4079535, 4350267, 4620430, 4890728, 5160184, 6770930],
            "pother":[6845367, 6940090, 7017134, 7077541, 7135010, 7190427, 7248224, 7301079, 7359295]}
# ....
breaks = dBreaks[fcSource]
    print breaks
    for step in range(len(breaks)):
        if step == 0:
            expr = "PAR_ID < "+str(breaks[step])
        elif step == len(breaks)-1:
            expr = "PAR_ID >= "+str(breaks[step])
        else:
            expr = "PAR_ID >= "+str(breaks[step-1])+" and PAR_ID < "+str(breaks[step])
        print "step",step,expr
        e = 0
        dApp = dict([(row[0],row[1]+" "+row[2]) for row in arcpy.da.SearchCursor(wsSource+"/"+fcLabel,["par_id","legal1","legal2"],expr)])
        print "recprds",len(dApp)
        with arcpy.da.UpdateCursor(fcTarget,["PAR_ID",fldApp],expr) as cur:
            for row in cur:
                try:
                    row[1] = dApp[row[0]]
                except:
                    e+=1
                cur.updateRow(row)
        if e > 0: print "errors",e

[/URL]

This only works if the key is numeric. When the key is not numeric and there is no other field to split the source and target featureclasses I copy one of the tables into the same geodatabase so that I can use the MakeQueryTable tool and then CopyFeatures immediately to a new featureclass. (You don't seem to be able to do any further processing on this strange layer.) Further processing might be FeatureToFeature to rename fields because MakeQueryTable prefixes the table name even if QualifedFieldNames is set to False.