I'm busy rewriting my hydrological model python scripts into functions that are modular and found that my original python script runs twice as fast as my updated python script. The only reason I can find is that the original scripts outputs are saved in memory before being passed onto the next function. My new python script is writing and reading each output from disk. I might be incorrect in my understanding. If anyone can give me guidance on improving my new python script in both pythonic structure and performance.
Original Python Script
''' Created on May 20, 2015 @author: PeterW ''' # import system modules and site packages import os import time import arcpy import ArcHydroTools # check out Spatial Analyst Extension arcpy.CheckOutExtension("Spatial") # set environment settings arcpy.env.overwriteOutput = True # set input and output arguments raw = r"E:\Python\Temp\Model04\DEM\raw" rasWs = r"E:\Python\Temp\Model04\Layers04" outWs = r"E:\Python\Temp\Model04\Model04.gdb" # Processing time def hms_string(sec_elapsed): h = int(sec_elapsed / (60 * 60)) m = int(sec_elapsed % (60 * 60) / 60) s = sec_elapsed % 60 return "{}h:{:>02}m:{:>05.2f}s".format(h, m, s) start_time1 = time.time() # archydro variables fill_sinks = os.path.join(rasWs, "fil") flow_dir = os.path.join(rasWs, "fdr") flow_acc = os.path.join(rasWs, "fac") streams = os.path.join(rasWs, "str") stream_seg = os.path.join(rasWs, "strlnk") catchment_grid = os.path.join(rasWs, "cat") catchment_poly = os.path.join(outWs, "Layers","Catchment") drainage_line = os.path.join(outWs, "Layers", "DrainageLine") adj_catch = os.path.join(outWs, "Layers", "AdjointCatchment") try: # calculate the fill sinks arcpy.AddMessage("Processing Fill Sinks") ArcHydroTools.FillSinks(raw, fill_sinks) # calculate the flow direction arcpy.AddMessage("Processing Flow Direction") ArcHydroTools.FlowDirection(fill_sinks, flow_dir) # calculate the flow accumulation arcpy.AddMessage("Processing Flow Accumulation") ArcHydroTools.FlowAccumulation(flow_dir, flow_acc) # calculate the maximum flow accumulation arcpy.AddMessage("Processing Flow Accumulation Maximum") maxcellsResult = arcpy.GetRasterProperties_management(flow_acc, "MAXIMUM") maxcells = maxcellsResult.getOutput(0) print maxcells # calculate the stream threshold number of cells arcpy.AddMessage("Processing Stream Threshold") stream_threshold_numcells = (int(maxcells)*0.125/100) print stream_threshold_numcells # calculate the stream definition arcpy.AddMessage("Processing Stream Definition") ArcHydroTools.StreamDefinition(flow_acc, stream_threshold_numcells, streams) # calculate the stream segmentation arcpy.AddMessage("Processing Stream Segmentation") ArcHydroTools.StreamSegmentation(streams, flow_dir, stream_seg) # calculate the catchment grid delineation arcpy.AddMessage("Processing Catchment Grid Delineation") ArcHydroTools.CatchmentGridDelineation(flow_dir, stream_seg, catchment_grid) # calculate the catchment polygons from the catchment grid arcpy.AddMessage("Processing Catchment Polygons") ArcHydroTools.CatchmentPolyProcessing(catchment_grid, catchment_poly) # calculate the drainage lines from the stream segmentation grid arcpy.AddMessage("Processing DrainageLines") ArcHydroTools.DrainageLineProcessing(stream_seg, flow_dir, drainage_line) # calculate the adjoint catchment polygons arcpy.AddMessage("Processing Ajdoint Catchments") ArcHydroTools.AdjointCatchment(drainage_line, catchment_poly, adj_catch) arcpy.AddMessage("Completed Processing archydro Main Model") except: print(arcpy.GetMessages(2)) pass # Determine the time take to process hydrological characteristics end_time1 = time.time() print ("It took {} to process hydrological characteristics".format(hms_string(end_time1 - start_time1))) arcpy.CheckInExtension("Spatial")
New Python Script
''' Created on Feb 24, 2016 @author: PeterW ''' # import system modules and site packages import time from pathlib import Path import arcpy import ArcHydroTools # check out extension arcpy.CheckOutExtension("Spatial") # set environment settings arcpy.env.overwriteOutput = True # set input and output arguments dem = r"E:\Python\Temp\Model04\DEM\raw" raster_workspace = r"E:\Python\Temp\Model04\Layers04" fgdb = Path(r"E:\Python\Temp\Model04\Model04.gdb") # Processing time def hms_string(sec_elapsed): h = int(sec_elapsed / (60 * 60)) m = int(sec_elapsed % (60 * 60) / 60) s = sec_elapsed % 60 return "{}h:{:>02}m:{:>05.2f}s".format(h, m, s) start_time1 = time.time() # generate fill sinks grid def fill_sinks(dem, raster_workspace): fill_sinks = "{0}\\{1}".format(raster_workspace, "fil") arcpy.AddMessage("Processing Fill Sinks") ArcHydroTools.FillSinks(dem, fill_sinks) fill_sinks(dem, raster_workspace) # generate flow direction grid def flow_direction(raster_workspace): fill_sinks = "{0}\\{1}".format(raster_workspace, "fil") flow_direction = "{0}\\{1}".format(raster_workspace, "fdr") arcpy.AddMessage("Processing Flow Direction") ArcHydroTools.FlowDirection(fill_sinks, flow_direction) flow_direction(raster_workspace) # generate flow accumulation grid def flow_accumulation(raster_workspace): flow_direction = "{0}\\{1}".format(raster_workspace, "fdr") flow_accumulation = "{0}\\{1}".format(raster_workspace, "fac") arcpy.AddMessage("Processing Flow Accumulation") ArcHydroTools.FlowAccumulation(flow_direction, flow_accumulation) flow_accumulation(raster_workspace) # calculate stream threshold based on 0.5% of maximum flow accumulation def stream_threshold(raster_workspace): flow_accumulation = "{0}\\{1}".format(raster_workspace, "fac") arcpy.AddMessage("Processing Flow Stream Threshold") maxcellsResult = arcpy.GetRasterProperties_management(flow_accumulation, "MAXIMUM") maxcells = maxcellsResult.getOutput(0) arcpy.AddMessage("{} Maximum Cells".format(maxcells)) stream_threshold_numcells = (int(maxcells)*0.5/100) arcpy.AddMessage("{} Stream Threshold".format(stream_threshold_numcells)) return stream_threshold_numcells numcells = stream_threshold(raster_workspace) # generate the stream definition grid def stream_definition(raster_workspace): flow_accumulation = "{0}\\{1}".format(raster_workspace, "fac") stream = "{0}\\{1}".format(raster_workspace, "str") arcpy.AddMessage("Processing Stream Definition") ArcHydroTools.StreamDefinition(flow_accumulation, numcells, stream) stream_definition(raster_workspace) # generate the stream segmentation grid def stream_segmentation(raster_workspace): stream = "{0}\\{1}".format(raster_workspace, "str") flow_direction = "{0}\\{1}".format(raster_workspace, "fac") stream_link = "{0}\\{1}".format(raster_workspace, "strlnk") arcpy.AddMessage("Processing Stream Segmentation") ArcHydroTools.StreamSegmentation(stream, flow_direction, stream_link) stream_segmentation(raster_workspace) # calculate the catchment grid delineation def catchment_grid(raster_workspace): flow_direction = "{0}\\{1}".format(raster_workspace, "fdr") stream_link = "{0}\\{1}".format(raster_workspace, "strlnk") catchment_grid = "{0}\\{1}".format(raster_workspace, "cat") arcpy.AddMessage("Processing Catchment Grid Delineation") ArcHydroTools.CatchmentGridDelineation(flow_direction, stream_link, catchment_grid) catchment_grid(raster_workspace) # calculate the catchment polygons from the catchment grid def catchment_polygon(raster_workspace, fgdb): catchment_grid = "{0}\\{1}".format(raster_workspace, "cat") catchment_polygon = "{0}\\{1}".format(Path(fgdb, "Layers"), "Catchment") arcpy.AddMessage("Processing Catchment Polygon") ArcHydroTools.CatchmentPolyProcessing(catchment_grid, catchment_polygon) catchment_polygon(raster_workspace, fgdb) # calculate the drainage lines from the stream segmentation grid def drainage_line(raster_workspace, fgdb): stream_link = "{0}\\{1}".format(raster_workspace, "strlnk") flow_direction = "{0}\\{1}".format(raster_workspace, "fdr") drainage_line = "{0}\\{1}".format(Path(fgdb, "Layers"), "DrainageLine") arcpy.AddMessage("Processing DrainageLine") ArcHydroTools.DrainageLineProcessing(stream_link, flow_direction, drainage_line) drainage_line(raster_workspace, fgdb) # calculate the adjoint catchment polygons def adjoint_catchment(fgdb): drainage_line = "{0}\\{1}".format(Path(fgdb, "Layers"), "DrainageLine") catchment_polygon = "{0}\\{1}".format(Path(fgdb, "Layers"), "Catchment") adjoint_catchment = "{0}\\{1}".format(Path(fgdb, "Layers"), "AdjointCatchment") arcpy.AddMessage("Processing Adjoint Catchment") ArcHydroTools.AdjointCatchment(drainage_line, catchment_polygon, adjoint_catchment) adjoint_catchment(fgdb) arcpy.CheckInExtension("Spatial") # Determine the time take to process hydrological characteristics end_time1 = time.time() print ("It took {} to process hydrological characteristics".format(hms_string(end_time1 - start_time1)))
Solved! Go to Solution.
These won't improve performance*, but will make it more readable, less code to manage, and less code to mess up.
*It may insignificantly improve by nanoseconds by not having to re-create the same variable over and over again!
Hi Jason
Thanks for your feedback, its truly appreciated as I'm still a novice when it comes to programming. With regards to your suggestions. Would you mind giving me an example to explain your first suggestion:
I only realized once you pointed it out that I was defining my variables multiple times, thanks for pointing it out.
Lastly could you give me some advice how I could create a function that would build my input and output arguments based on the raster workspace and fgdb.
# imports at top import os import arcpy # etc. . . # define methods here def fill_sinks(): # fill sink code def flow_direction(): # flow direction code def flow_accumulation(): # flow accumulation code # etc... # write your statements inline here, and better yet test for "__main__" first # to only execute this code if you are executing this file if __name__ == '__main__' fill_sinks() flow_direction() flow_accumulation() # etc . . .
I
Hi Jason
After fixing the line 98 incorrect allocation of fac (Flow Accumulation) to Flow Direction, my script completes within the same amount of time as my previous script. I'm going to look into developing a function to allocate the variables automatically as this is standardised. The only input required from the user is the:
Since this isn't a class, only a script, you really don't need to put this in a method. How you did it in your original script would work just fine:
# archydro variables fill_sinks = os.path.join(rasWs, "fil") flow_dir = os.path.join(rasWs, "fdr") flow_acc = os.path.join(rasWs, "fac") streams = os.path.join(rasWs, "str") stream_seg = os.path.join(rasWs, "strlnk") catchment_grid = os.path.join(rasWs, "cat") catchment_poly = os.path.join(outWs, "Layers","Catchment") drainage_line = os.path.join(outWs, "Layers", "DrainageLine") adj_catch = os.path.join(outWs, "Layers", "AdjointCatchment")
If you prefer Path, just replace os.path.join(x,y,z) with str(Path(x,y,z)).
Honestly, for what you are doing, I think you might have over-engineered your solution on the 2nd script. The 1st one didn't look too bad. For performance, make sure everything is working in memory like other people have suggested. Only output it to disk when you are through processing it.
Hi Peter-
Your observations about read/write speeds are really important for effective geoprocessing. Since you running in a stand-alone script, you can use the "IN_MEMORY" workspace to speed up many transactions. I'll let you read up on it, but essentially you create a virtual GDB called "IN_MEMORY" and set your environment workspace to reference it. For vector operations you can see huge improvements - raster as well, if you have the memory to support it.
More generally, I'd look at the read-write configurations on your workstation. Reading and writing to a server will be slow, reading and writing to a local SSD will be fast. You will also see improvements if you can separate out read and write media - e.g. read from one SSD, write to another. If you have lots of RAM, consider defining a RAM disk at the OS level - this will be virtual hard drive that occupies RAM, and will be Very, Very fast. But not as much room (typically ~4 GB for the free RamDisk tools).
As mentioned above, there are system profiling tools in Python to diagnose the relative cost of different implementations. You can see dramatic changes in Python processing time by changing from C-based/C-optimized intrinsic functions, to true-Python functions. Again, you can read up on that (especially with respect to iterators and memory models for hash tables).
Have fun!
Daryl
I'll second Daryl's comments and add that 'IN_MEMORY' can also vastly speed up table operations such as joins, especially when both datasets are 'IN_MEMORY'. The improvements I have documented depend on the tool, but 2-20x improvements are quite common. However, for table manipulation it's still not nearly as fast as storing data in a true SQL database and then doing your SQL operations there. File servers are notoriously slow but I haven't tried splitting my reads/writes between SSDs or using RAMdisk. Thanks for the ideas!
Brian