Modifying Permanent Sort script by Chris Snyder

RichardFairhurst · ‎08-02-2010

Chris:

I am not sure how best to contact you directly, but I have been trying to modify a script you wrote and posted on the public scripts page about a year ago. Your script permanently sorts a feature class or table to a new output feature class or table.

I have been trying to modify the script to optionally add a valid additional field of type Long to the output that will store a copy of the original data source's OID field values. Your script by default drops the original OID values from the output. However, I need these OID values where I want to join the output back to the source, but the data source has no other unique ID than the OID field and I cannot alter the source data schema. I have some code that basically works if I can assume that all OID fields are actually named "OBJECTID", but I do not think that is the case.

In VBA there are methods that can query a data source to confirm it has an OID and then get the OID field's name (which tells me there is more than one way these fields can be named). Is there anything equivalent in Python? Or how would you approach checking these two aspects of a data source to add appropriate error checking routines to your Python script to handle a subroutine like the one I want? If the new subroutine works, I would always be able to optionally include a join field in my sorted output that would work with any source that has an OID value without ever having to modify the source schema.

Thanks for the script and I hope you can help me.

Rich

ChrisSnyder · ‎08-04-2010

Rich,

You are totally correct! Great observation... I thought it seemed like when I was testing this stuff yesterday it was running a bit slow when I used the ORIG_OID option... Another idea to structure it would be to append the ORIG_OID field (if the user provides it) to the inputLayerFieldList, and then have a condition in the loop that says if inputLayerField == originalOidFieldName, then attempt to write it.

RichardFairhurst · ‎08-04-2010

I wanted to also modify the loop processing to implement a standard ArcObjects practice, but I do not know if Python supports it. In ArcObjects it is much more efficient to use a field's index number that indicates the field's relative numeric position in the table rather than its field name to read and write data. The reason is that when a field name is passed to ArcObject methods that use a field name, the method in the background processes an invisible loop through the table schema to get at a field's index position. However, by using methods that directly use the field index number a loop is not triggered and it operates more like accessing an array element with an index value.

It is a standard practice in ArcObjects to speed up loops using dual cursors to first match up the input and output field names to obtain a key pair array of field index values outside of the loop. Then within the loop the key pair indexes are used as directly as possible and methods that trigger the field name/field schema matching hidden loop are avoided. Using the preloop matched field index directly in ArcObjects can speed up cursor get and set value operations by a factor that averages the number of fields divided by 2. The effect on performance becomes very pronounced in a loop.

Is there anything equivalent in Python that will provide and use a field index value with a cursor rather than a field name? If not I believe all loops with get and set value cursor operations that use a field name in reality are processing two loops (and getting an increased hit as fields toward the end of the schema are accessed). If no similar field index capability exists for Python, that might provide a partial explanation for the attrocious performance of Python on joined tables as opposed to VBA or VB Script in the ModelBuilder field calculator.

Rich

ChrisSnyder · ‎08-04-2010

Wow that's good info! So I do know of a way to get the field's index via Python, but it's slow and hokey, and you have to use a featurelayer or tableview as input. For example:

gp.makefeaturelayer(fc,"fl")
print gp.describe("fl").fieldinfo.findfieldbyname("TER_ID")
6 #that is to say TER_ID is the 7th field, 0 being the 1st.

However, I don't think there is a way to retrieve a field value in a cursor using it's index value. Maybe I'm wrong though... Seems that would be a major performance enhancement like you said. I have mostly shied away from embedded cursors (current script being an exception) since they tend to be slow.

One thing I have used a lot (and that could be useful here if the table wasn't too huge and you didn't care to transfer the geometry) is using a Python Dictionary object to emulate a table structure in memory. Once you have loaded a table into a dictionary (using a searchcursor) it is EXTREEMLY fast to access the field values. This is especially useful to when you have a need to constantly look up lots and lots of unordered values. For example, I wrote a script that performs traces of a stream network (created using the hydrology tools). The idea is that every arc is flowing downhill, so has a startnode that is upstream of the endnode. The basic approach I took was to load the OIDs and their start/end nodes into a dictionary. Then (if tracing downstream) I basically traverse the items in the dictionary, looking up the arcs end node, searching for an arc that has its startnode == the last end node and so on. See: http://forums.esri.com/thread.asp?t=275202&f=1729&c=93#948148 for the code.

A simple dictionary example:

lookupDict = {}
searchRows = gp.searchcursor(lookupTbl)
searchRow = searchRows.next()
while searchRow:
   lookupDict[searchRow.LOOKUPITEM] = [searchRow.FIELDVALUE1,searchRow.FIELDVALUE2]
   searchRow = searchRows.next()
del searchRow
del searchRows

updateRows = gp.updatecursor(tableToUpdateTbl)
updateRow = updateRows.next()
while updateRow:
   updateRow.MEAN = (lookupDict[updateRow.LOOKUPITEM][0] + lookupDict[updateRow.LOOKUPITEM][1]) / 2
   updateRows.UpdateRow(updateRow)
   updateRow = updateRows.next()
del updateRow
del updateRows
del lookupDict

RichardFairhurst · ‎08-04-2010

Rich,

You are totally correct! Great observation... I thought it seemed like when I was testing this stuff yesterday it was running a bit slow when I used the ORIG_OID option... Another idea to structure it would be to append the ORIG_OID field (if the user provides it) to the inputLayerFieldList, and then have a condition in the loop that says if inputLayerField == originalOidFieldName, then attempt to write it.

I perfer my version, because it allows me to distinguish three cases:

1. No user OID field name provided.
2. User OID field provided but no input OID available
3. User OID field provided, input OID available, but write attempt fails anyway.

Case 2 might not be anticipated by the user. Case 3 is especially likely to happen when a geodatabase is being ported to a .dbf or .shp file and field name truncations do not match the way the schema was imported into the output feature class/table.

Case 3 could also occur because the code currently does not test if the user provided a valid field against the output data source, which could happen if the user unknowingly uses a data source keyword or included spaces in the name or something else. A gp.ValidateFieldName probably should be used in the code and if a different name is returned by that tool the new name should be appended to the output table and the user should be warned about the field name transformation. This report would also help for the possible errors in field tranferrance that may result from trucating field names in and geodatabase to .dbf and .shp file transformations. The appended user OID would still allow the user to join and manually transfer any missed field data to the output when they determined there was a failure that cannot be remedied with the current implementation.

I hope this helps.

Rich

ChrisSnyder · ‎08-04-2010

Jinx posting. Look at my post right above yours concerning dictionaries. This is all great stuff - I'm a hack self-taught programmer, so it's great to get someone else's suggestions, tips, tricks, etc.

RichardFairhurst · ‎08-04-2010

However, I don't think there is a way to retrieve a field value in a cursor using it's index value. Maybe I'm wrong though... Seems that would be a major performance enhancement like you said. I have mostly shied away from embedded cursors (current script being an exception) since they tend to be slow.

Your current code only creates a hit that is about twice as bad as using a single cursor since you are not really doing a record match on both sides of the cursors. You are just reading a record on one side and writing a new record on the other. The practice I mentioned for ArcObjects should still dramatically speed performance for a single cursor and for this kind of simple embedded cursor pair if Python had an efficient way to implement it. However, for truly embedded cursor routines that have to look up records against each other or walk through a sorted pair of cursors (using multi-field keys especially), the primary hit is from maintaining the secondary query synchronization on unsorted records or records that are only sorted in memory.

Your tool that permanently sorts to a new table (without reordering fields on the original to use a summary/frequency) offers me the potential to solve the dual cursor synchronization issue, which was partly why I am so interested in it. Developing an efficient way to do a single pass on a pair of cursors against tables with permanently sorted records could speed up the embedded cursors significantly. By operating on all fields at once in a record with the cursor pair I should be able to be much more efficient than using the field calculator on a pair of joined tables operating one field at a time (recognizing that there are diminishing returns as I need to do this with fewer fields). The OID addition I requested means that by doing a second sort back to original OID order I could use the same technique to transfer the data back to the original data source (without having to do a memory based sort). Linking back to the original is particularly important to me where the original cannot be overwritten because it has topologies, feature linked annotation or other advanced features (or I do not control the original data but need to maintain a duplicate that can use these techiniques to extend and enhance the source data with data that I maintain).

The dictionary approach is valuable when the spatial information does not need to be taken into consideration. Unfortunately, the spatial data almost always plays a roll in my cursor routines. In my case I may use Visual Studio to do the embedded cursor work (or I might port your Python code to Visual Studio and see if I can get a substantial performance gain using an ArcObject cursor method I described above). I am not really comfortable porting my VBA to Visual Studio either, but if I am going to have to do another learning curve I want others to benefit (like I have from this exchange). Anyway, thanks for pointing me in a new direction that I think really begins to tackle some of the problems I am trying to solve.

(Edit) Actually, the dictionary approach may offer some interesting possibilies for me since I am in Transportation and most of my problems relate to traversing line networks in different ways. Linear referencing has solved some of my problems, but it would be nice to create efficient cursor routines that can quickly traverse Route IDs and measure values on joined event tables of different inventories (signals, signs, curb ramps, accidents) and perform sophisticated analysis for output to a summary table. My bosses would like to do segment analysis based on 1000 intervals or between interesections and I don't see a way to really do it without some kind of cursor operation. Route IDs and measures provide me with a way to understand direction of travel and network traversal without usiing the shapes (if used together with an X/Y coordinate the analysis possibilities become very interesting). Trying to train the computer to talk with end user data and needs is where I spend a lot of my time. Anyway, ordering data seems to be the biggest factor in making these routines really perform well so this is a subject near and dear to my heart.

Rich

RichardFairhurst · ‎08-04-2010

Jinx posting. Look at my post right above yours concerning dictionaries. This is all great stuff - I'm a hack self-taught programmer, so it's great to get someone else's suggestions, tips, tricks, etc.

Double jinx posting. My background is the same. I am not part of IT. I'm just a guy who wants to get the job done and transform the way my Department sees and does things through GIS.

ChrisSnyder · ‎08-04-2010

However, for truly embedded cursor routines that have to look up records against each other or walk through a sorted pair of cursors (using multi-field keys especially), the primary hit is from maintaining the secondary query synchronization on unsorted records or records that are only sorted in memory.

This is a great description of the case where the dictionary structure is so fast/usefull!

ChrisSnyder · ‎08-04-2010

Holy crap! Rich, you got me thinking here and check this out. I had no idea you could do this (load the geometry object ito a dictionary so easily). Not sure how to use it, but that so damn cool!!! basically this way you could store an entire FC as a Python dictionary! Wow!

dict = {}
>>> searchRows = gp.searchcursor(fc)
>>> searchRow = searchRows.next()
>>> dict[0] = searchRow.Shape
>>> dict
{0: <geoprocessing describe geometry object object at 0x00C1AAE8>}
>>> del searchRow
>>> del searchRows
>>> dict[0].getpart(0)
<geoprocessing array object object at 0x00C1A890>
>>> dict[0].getpart(0).next().x
820175.92491082847
>>>

ChrisSnyder · ‎08-04-2010

Rich, you will like this:

import arcgisscripting
gp = arcgisscripting.create(9.3)
fc = r"D:\csny490\oesf_nso_models_20100623\landscape_model\results\results_bu.gdb\summary_results"
fieldList = gp.listfields(fc)
fieldIndex = 0
fieldNameDict = {}
for field in fieldList:
    fieldNameDict[field.name] = fieldIndex
    fieldIndex = fieldIndex + 1
fcDictionary = {}
oidFieldName = gp.describe(fc).oidfieldname
searchRows = gp.searchcursor(fc)
searchRow = searchRows.next()
while searchRow:
    recordReadCount = 0
    oidFieldValue = searchRow.getvalue(oidFieldName)
    fcDictionary[oidFieldValue] = []
    for field in fieldList:
        fcDictionary[oidFieldValue].append(searchRow.getvalue(field.name))
    searchRow = searchRows.next()
del searchRow
del searchRows

#Now that it's in the memory, fetch the habitat density field value of OID = 31 and its width...
fcDictionary[31][fieldNameDict["HAB_DENSITY"]] + " - " + str(fcDictionary[31][fieldNameDict["Shape"]].extent.width) + " feet wide!"