Processing time with Hydrology tools in standalone Python script vs. ArcCatalog

ThomasBurley · ‎04-11-2011

Hi,
My question is: what might cause the ArcGIS 9.3 Hydrology Toolbox Snap Pour Point and Watershed tools to run *significantly* slower in a
Python script versus manually running the tools in ArcCatalog (both runs using the exact same input data files and specification parameters)??

I'm running this script from the windows command line so neither ArcMap or ArcCatalog are open; using ArcGIS 9.3.1 on a Windows XP machine with 4 GB of RAM and
plenty of processor power (a Dell Precision machine); and all input data are local.

I have pre-processed Flow Accumulation, Flow Direction, and a Feature Class of water-quality sample sites to be used as pour points, and all
datasets are in NAD83 UTM 17N projection. The Flow Accum and Flow Dir data files have spatial extents based on Hydrologic Unit Code (HUC) eight digit boundaries.
The script iterates through each record in the sample site feature class, determines which eight digit HUC the point falls in,
selects and exports the site record as a new feature class in a scratch geodatabase, and then handles the watershed processing using that exported point feature class.

I've stripped out comments and some other calls to a "processing log timestamp" function for reporting tool run times just to focus here on the question
at hand. The biggest time sinks based on the processing log reporting are the Snap Pour Point tool and Watershed tool.
We're talking on the order of 4+ minutes per site using a subset of all the sites for testing, and then 3-4 minutes for each site with the Watershed tool.

However, I can take the same single point feature class and the same flow accumuluation and flow direction rasters that the Python script is using,
and run the Snap Pour Point and Watershed tools manually in ArcCatalog in approx 1 minute or less for each site (the resulting watersheds are the same as
the watersheds the script produces). Example: one of the sites I manually ran the Snap Pour Point tool in 58 seconds via ArcCatalog, while the Python
script executing that tool took 3 minutes 59 seconds for the tool to complete. I then ran the Watershed tool manually for the same site and it
took 1 minute 1 second in ArcCatalog, while the Python script executing the Watershed tool took 4 minutes 8 seconds for the tool to complete.
I'm using a 50 meter buffer for the Snap Pour Point tool in both the Python script and during my ArcCatalog manual run comparisons.

As you can see, CONSIDERABLY slower in the Python script. I've hit a wall with what I can figure out, so much appreciate any suggestions or insight.

thanks
Tom

try:
 if hucGdb in gdbFiles:
   outScratch = scratch + '\\' + delinSiteID + 'temp'
   flowAccum = filePath + '\\' + hucGdb + '\\' + 'FA_' + huc
   flowDir = filePath + '\\' + hucGdb + '\\' + 'FD_' + huc
   
   tempEnvironment = gp.extent
   gp.extent = flowAccum
   clause = '[DELIN_ID] = ' + "'" + delinSiteID + "'"
   gp.Select_analysis(delinSites, outScratch, clause)
   gp.extent = tempEnvironment

   outSnapRaster = snapPoint + '\\' + 'p' + delinSiteID

   gp.SnapPourPoint_sa(outScratch, flowAccum, outSnapRaster, tolerance)
   gp.extent = tempEnvironment
   
   outSnapFeature = snapPoint + '\\' + 'p' + delinSiteID + '_snap'
   gp.RasterToPoint_conversion(outSnapRaster, outSnapFeature, "VALUE")

   gp.AddField_management(outSnapFeature, "ORIGINAL_SITE_ID", "TEXT", "#", "#", "20")
   gp.CalculateField_management(outSnapFeature, "ORIGINAL_SITE_ID", '"' + realID + '"')

   gp.AddField_management(outSnapFeature, "DELIN_ID", "TEXT", "#", "#", "20") 
   gp.CalculateField_management(outSnapFeature, "DELIN_ID", '"' + delinSiteID + '"')
   
   outWsRaster = scratch + '\\' + delinSiteID
   tempEnvironment = gp.extent
   gp.extent = flowDir
   
   gp.Watershed_sa(flowDir, outSnapRaster, outWsRaster)
   gp.extent = tempEnvironment
   
   outWsFeature = basinOutput + '\\' + delinSiteID

   gp.RasterToPolygon_conversion(outWsRaster, outWsFeature, "SIMPLIFY")
   
   gp.AddField_management(outWsFeature, "DELIN_ID", "TEXT", "#", "#", "25") 
   gp.CalculateField_management(outWsFeature, "DELIN_ID", '"' + delinSiteID + '"')
   
   gp.AddField_management(outWsFeature, "ORIGINAL_SITE_ID", "TEXT", "#", "#", "25")
   gp.CalculateField_management(outWsFeature, "ORIGINAL_SITE_ID", '"' + realID + '"')

else:
  msg = """Site ID """ + realID + """ does not have a geodatabase of
  data, moving onto the next site"""
  print msg
  logFile.write(msg + '\n')
  
except:
print "ARCGISSCRIPTING ERROR: " + gp.GetMessages(2)
msg = "ARCGISSCRIPTING ERROR: " + gp.GetMessages(2)
logFile.write(msg + '\n')
print "ERROR: " + ErrorDesc.message
msg = "ERROR: " + ErrorDesc.message
logFile.write(msg + '\n')

row = rows.next()

curtvprice · ‎04-11-2011

My main question is are you sure this comparison is this apples to apples, ie have you compared the geoprocessing environments between the two, and are you using the tools in ArcCatalog or the (different) arc objects interface in the spatial analyst toolbar. (The environment from the toolbar is different than the GP tool environment.)

Note that our gp script running from IDLE or elsewhere outside of ArcGIS does not inherit the default ArcCatalog environment. If you want a current and scratch workspace to be set for example, or have certain toolboxes (besides system toolboxes) loaded, you must set or add them in your script if you are running "standalone."

If you are using tools in ArcCatalog, and the ArcCatalog output is a folder and the python output is a geodatabase, the ArcCatalog version may be running faster because only grids are native read/write in 9.3.x, other formats must be copied to/from GRID format before or after tool processing..

Another slowdown can be the fact that from ArcCatalog you don't need to instantiate a geoprocessor, but that usually only affect tool startup time, not processing time.

A tip in the raster environment that always makes tools run faster is make sure the current and scratch workspaces are both set to a (local) folder. This saves a lot of extra file copying etc. (Raster tools create a temp file in scratch while they work ("g_g_g2" etc) and rename it as a last step. If your scratch and current workspace are in different places (or current workspace or output isn't a folder) the entire tool output must be copied -- takes a while for a large raster.)

ThomasBurley · ‎04-11-2011

Hi-
I know it's not instantiating the gp that's an issue - I have timestamps for when a particular tool starts, and when it finishes. Watershed and Snap Pour Point are the ones with terrible performance (I've build in a lot of print statements and writing to a log .txt file the timestamps for post-run analysis of where the time suck(s) are). I'm using the ArcToolBox Spatial Analyst Hydrology tools, not Toolbar tools.

In both runs (python script and manually in arccatalog) I'm outtputing to and inputting from the same personal geodatabases - they are not crammed with data so I'm not pushing a data size limit or anything. The elevation derivative rasters are in the IMAGINE image format in the pgdb's.

How can you change what you mention regarding current and scratch workspaces? I noticed in my Windows user profile Temp directory (C:\Documents and Settings\...\Local Settings\Temp) that there are some of the g_g_g2 files (or similar) in there.

Just to see what happens, I set up a new ArcToolbox Toolbox and added my python script as a script that I could run. I have the script running based on argument input from a "configuration" text file that it reads in, so in this fashion I just double click on the script and click "OK" as there are no variables to set. The script ran in 7 minutes 32 seconds vs. 20+ minutes for my same three test sites, one-third of the total time. It definitely shaved off a bunch of time on the Snap Pour Point and Watershed tool runs (everything else was already reasonably fast so not worried about other tools/functions). I ran the script via ArcToolBox in ArcCatalog, not ArcMap.

SO....it seems that there is something going on (or perhaps in this case not going on) when you run a python script outside of ArcGIS (via command line or an IDE) that incurs considerable overhead on raster-calculation tools.

I don't have any special raster settings in the ArcToolBox environments. As far as I know, these are the defaults (I've never messed with them - see attached screenshot). How would these (or a lack of these defaults) impact such tools - is that the issue? If so, can they be set in standalone python code??

thanks- I think this is getting closer

Tom

curtvprice · ‎04-11-2011

In both runs (python script and manually in arccatalog) I'm outtputing to and inputting from the same personal geodatabases - they are not crammed with data so I'm not pushing a data size limit or anything. The elevation derivative rasters are in the IMAGINE image format in the pgdb's.

As I said for best performance overall you should read and write directly to grids, specially with tools like this that may be working with large rasters. You do this by writing to a folder workspace with no file extensions used in the output file name.

How can you change what you mention regarding current and scratch workspaces? I noticed in my Windows user profile Temp directory (C:\Documents and Settings\...\Local Settings\Temp) that there are some of the g_g_g2 files (or similar) in there.

The current and scratch workspace are part of the geoprocessing environment. In Python you do this by setting geoprocessor properties and methods. The main settings i was concerned about performance-wise were the current and scratch workspace, which are in the interface under general settings.

gp.Workspace = r"e:\work"
gp.ScratchWorkspace = r"e:\work"

Or use a configuration text file of course (saved from an desktop session using Save Environment) with gp.LoadSettings(). Those XML config rues are a really good way to make sure your interactive and outside-ArcGIS script processing were using the same environment.

Just to see what happens, I set up a new ArcToolbox Toolbox and added my python script as a script that I could run. I have the script running based on argument input from a "configuration" text file that it reads in, so in this fashion I just double click on the script and click "OK" as there are no variables to set. The script ran in 7 minutes 32 seconds vs. 20+ minutes for my same three test sites, one-third of the total time. It definitely shaved off a bunch of time on the Snap Pour Point and Watershed tool runs (everything else was already reasonably fast so not worried about other tools/functions). I ran the script via ArcToolBox in ArcCatalog, not ArcMap.

When you add a script tool to a toolbox and run it, it inherits the GP environment from the desktop session. Including its gp object (if you've checked the box to run in-process). So it's generally faster than "standalone".

By the way, a really good way to time to the tools themselves is to print tool messages, which include timings:

print str(gp.GetMessages(0)) # for IDLE or PythonWin window
gp.AddMessage(gp.GetMessages(0)) # for ArcCatalog/Map tool messages or Windows command window

ThomasBurley · ‎04-12-2011

It appears the gp.LoadSettings method does not work - I saved out the ArcCatalog ArcToolBox Environment settings as an xml file, and when I try to load the settings in the script, it says the object cannot be found even though I triple verified I had the path and file name correct. I found this forum post that discusses this in the context of ArcObjects but seems the Python approach does the same thing: http://forums.arcgis.com/threads/7204-igeoprocessor.savesettings-and-igeoprocessor.loadsettings-not-...

However, I did set the gp.ScratchWorkspace this time and that seems to have made the difference. I never would have thought that would impact the performance that much, I simply specified the same scratch workspace I have set in the ArcToolBox Environment. That's the only change I made in my testing configuration and that seems to have done it. The script ran from the command (using all the same input and output file settings and files) in about 6 minutes 49 seconds, actually about 40 seconds quicker than the ArcCatalog run I did. I verified the watersheds and all file output and looks to be spot on.

thank you, sir!

Tom