Pythonic Code Structure

PeterWilson · ‎02-24-2016

I'm busy rewriting my hydrological model python scripts into functions that are modular and found that my original python script runs twice as fast as my updated python script. The only reason I can find is that the original scripts outputs are saved in memory before being passed onto the next function. My new python script is writing and reading each output from disk. I might be incorrect in my understanding. If anyone can give me guidance on improving my new python script in both pythonic structure and performance.

Original Python Script

'''
Created on May 20, 2015


@author: PeterW
'''
# import system modules and site packages
import os
import time
import arcpy
import ArcHydroTools


# check out Spatial Analyst Extension
arcpy.CheckOutExtension("Spatial")


# set environment settings
arcpy.env.overwriteOutput = True


# set input and output arguments
raw = r"E:\Python\Temp\Model04\DEM\raw"
rasWs = r"E:\Python\Temp\Model04\Layers04"
outWs = r"E:\Python\Temp\Model04\Model04.gdb"


# Processing time
def hms_string(sec_elapsed):
    h = int(sec_elapsed / (60 * 60))
    m = int(sec_elapsed % (60 * 60) / 60)
    s = sec_elapsed % 60
    return "{}h:{:>02}m:{:>05.2f}s".format(h, m, s)


start_time1 = time.time()


# archydro variables
fill_sinks = os.path.join(rasWs, "fil")
flow_dir = os.path.join(rasWs, "fdr")
flow_acc = os.path.join(rasWs, "fac")
streams = os.path.join(rasWs, "str")
stream_seg = os.path.join(rasWs, "strlnk")
catchment_grid = os.path.join(rasWs, "cat")
catchment_poly = os.path.join(outWs, "Layers","Catchment")
drainage_line = os.path.join(outWs, "Layers", "DrainageLine")
adj_catch = os.path.join(outWs, "Layers", "AdjointCatchment")


try:
    # calculate the fill sinks
    arcpy.AddMessage("Processing Fill Sinks")
    ArcHydroTools.FillSinks(raw, fill_sinks)
    
    # calculate the flow direction
    arcpy.AddMessage("Processing Flow Direction")
    ArcHydroTools.FlowDirection(fill_sinks, flow_dir)
        
    # calculate the flow accumulation
    arcpy.AddMessage("Processing Flow Accumulation")
    ArcHydroTools.FlowAccumulation(flow_dir, flow_acc)
    
    # calculate the maximum flow accumulation
    arcpy.AddMessage("Processing Flow Accumulation Maximum")
    maxcellsResult = arcpy.GetRasterProperties_management(flow_acc, "MAXIMUM")
    maxcells = maxcellsResult.getOutput(0)
    print maxcells
    
    # calculate the stream threshold number of cells
    arcpy.AddMessage("Processing Stream Threshold")
    stream_threshold_numcells = (int(maxcells)*0.125/100)
    print stream_threshold_numcells
    
    # calculate the stream definition
    arcpy.AddMessage("Processing Stream Definition")
    ArcHydroTools.StreamDefinition(flow_acc, stream_threshold_numcells, streams)
    
    # calculate the stream segmentation
    arcpy.AddMessage("Processing Stream Segmentation")
    ArcHydroTools.StreamSegmentation(streams, flow_dir, stream_seg)
    
    # calculate the catchment grid delineation
    arcpy.AddMessage("Processing Catchment Grid Delineation")
    ArcHydroTools.CatchmentGridDelineation(flow_dir, stream_seg, catchment_grid)
    
    # calculate the catchment polygons from the catchment grid
    arcpy.AddMessage("Processing Catchment Polygons")
    ArcHydroTools.CatchmentPolyProcessing(catchment_grid, catchment_poly)
    
    # calculate the drainage lines from the stream segmentation grid
    arcpy.AddMessage("Processing DrainageLines")
    ArcHydroTools.DrainageLineProcessing(stream_seg, flow_dir, drainage_line)
    
    # calculate the adjoint catchment polygons
    arcpy.AddMessage("Processing Ajdoint Catchments")
    ArcHydroTools.AdjointCatchment(drainage_line, catchment_poly, adj_catch)
    
    arcpy.AddMessage("Completed Processing archydro Main Model")


except:
    print(arcpy.GetMessages(2))
    pass


# Determine the time take to process hydrological characteristics
end_time1 = time.time()
print ("It took {} to process hydrological characteristics".format(hms_string(end_time1 - start_time1)))


arcpy.CheckInExtension("Spatial")

New Python Script

'''
Created on Feb 24, 2016


@author: PeterW
'''
# import system modules and site packages
import time
from pathlib import Path
import arcpy
import ArcHydroTools


# check out extension
arcpy.CheckOutExtension("Spatial")


# set environment settings
arcpy.env.overwriteOutput = True


# set input and output arguments
dem = r"E:\Python\Temp\Model04\DEM\raw"
raster_workspace = r"E:\Python\Temp\Model04\Layers04"
fgdb = Path(r"E:\Python\Temp\Model04\Model04.gdb")


# Processing time
def hms_string(sec_elapsed):
    h = int(sec_elapsed / (60 * 60))
    m = int(sec_elapsed % (60 * 60) / 60)
    s = sec_elapsed % 60
    return "{}h:{:>02}m:{:>05.2f}s".format(h, m, s)


start_time1 = time.time()


# generate fill sinks grid
def  fill_sinks(dem, raster_workspace):
    fill_sinks = "{0}\\{1}".format(raster_workspace, "fil")
    arcpy.AddMessage("Processing Fill Sinks")
    ArcHydroTools.FillSinks(dem, fill_sinks)

fill_sinks(dem, raster_workspace)


# generate flow direction grid
def flow_direction(raster_workspace):
    fill_sinks = "{0}\\{1}".format(raster_workspace, "fil")
    flow_direction = "{0}\\{1}".format(raster_workspace, "fdr")
    arcpy.AddMessage("Processing Flow Direction")
    ArcHydroTools.FlowDirection(fill_sinks, flow_direction)
  
flow_direction(raster_workspace)


# generate flow accumulation grid
def flow_accumulation(raster_workspace):
    flow_direction = "{0}\\{1}".format(raster_workspace, "fdr")
    flow_accumulation = "{0}\\{1}".format(raster_workspace, "fac")
    arcpy.AddMessage("Processing Flow Accumulation")
    ArcHydroTools.FlowAccumulation(flow_direction, flow_accumulation)


flow_accumulation(raster_workspace)


# calculate stream threshold based on 0.5% of maximum flow accumulation
def stream_threshold(raster_workspace):
    flow_accumulation = "{0}\\{1}".format(raster_workspace, "fac")   
    arcpy.AddMessage("Processing Flow Stream Threshold")
    maxcellsResult = arcpy.GetRasterProperties_management(flow_accumulation, "MAXIMUM")
    maxcells = maxcellsResult.getOutput(0)
    arcpy.AddMessage("{} Maximum Cells".format(maxcells))
    stream_threshold_numcells = (int(maxcells)*0.5/100)
    arcpy.AddMessage("{} Stream Threshold".format(stream_threshold_numcells))
    return stream_threshold_numcells


numcells = stream_threshold(raster_workspace)


# generate the stream definition grid
def stream_definition(raster_workspace):
    flow_accumulation = "{0}\\{1}".format(raster_workspace, "fac")
    stream = "{0}\\{1}".format(raster_workspace, "str")
    arcpy.AddMessage("Processing Stream Definition")
    ArcHydroTools.StreamDefinition(flow_accumulation, numcells, stream)


stream_definition(raster_workspace)


# generate the stream segmentation grid
def stream_segmentation(raster_workspace):
    stream = "{0}\\{1}".format(raster_workspace, "str")
    flow_direction = "{0}\\{1}".format(raster_workspace, "fac")
    stream_link = "{0}\\{1}".format(raster_workspace, "strlnk")
    arcpy.AddMessage("Processing Stream Segmentation")
    ArcHydroTools.StreamSegmentation(stream, flow_direction, stream_link)


stream_segmentation(raster_workspace)


# calculate the catchment grid delineation
def catchment_grid(raster_workspace):
    flow_direction = "{0}\\{1}".format(raster_workspace, "fdr")
    stream_link = "{0}\\{1}".format(raster_workspace, "strlnk")
    catchment_grid = "{0}\\{1}".format(raster_workspace, "cat")
    arcpy.AddMessage("Processing Catchment Grid Delineation")
    ArcHydroTools.CatchmentGridDelineation(flow_direction, stream_link, catchment_grid)


catchment_grid(raster_workspace)


# calculate the catchment polygons from the catchment grid
def catchment_polygon(raster_workspace, fgdb):
    catchment_grid = "{0}\\{1}".format(raster_workspace, "cat")
    catchment_polygon = "{0}\\{1}".format(Path(fgdb, "Layers"), "Catchment")
    arcpy.AddMessage("Processing Catchment Polygon")
    ArcHydroTools.CatchmentPolyProcessing(catchment_grid, catchment_polygon)


catchment_polygon(raster_workspace, fgdb)


# calculate the drainage lines from the stream segmentation grid
def drainage_line(raster_workspace, fgdb):
    stream_link = "{0}\\{1}".format(raster_workspace, "strlnk")
    flow_direction = "{0}\\{1}".format(raster_workspace, "fdr")
    drainage_line = "{0}\\{1}".format(Path(fgdb, "Layers"), "DrainageLine")
    arcpy.AddMessage("Processing DrainageLine")
    ArcHydroTools.DrainageLineProcessing(stream_link, flow_direction, drainage_line)


drainage_line(raster_workspace, fgdb)


# calculate the adjoint catchment polygons
def adjoint_catchment(fgdb):
    drainage_line = "{0}\\{1}".format(Path(fgdb, "Layers"), "DrainageLine")
    catchment_polygon = "{0}\\{1}".format(Path(fgdb, "Layers"), "Catchment")
    adjoint_catchment = "{0}\\{1}".format(Path(fgdb, "Layers"), "AdjointCatchment")
    arcpy.AddMessage("Processing Adjoint Catchment")
    ArcHydroTools.AdjointCatchment(drainage_line, catchment_polygon, adjoint_catchment)


adjoint_catchment(fgdb)


arcpy.CheckInExtension("Spatial") 


# Determine the time take to process hydrological characteristics
end_time1 = time.time()
print ("It took {} to process hydrological characteristics".format(hms_string(end_time1 - start_time1)))

JasonTipton · ‎02-25-2016

Group all of your methods at the top under imports instead of co-mingling functions and statements.
Define your variables once instead of multiple times. (flow_accumulation, flow_direction, etc) assuming they are always the same in every method.
- Line 98 (Script 2): It looks like you assigned flow accumulation "fac" to the flow_direction variable
Don't pass RasterWorkspace into every function. Declare it once and use it inside each method.
You don't really need methods.
- After you do the previous, the way it is currently written, you really just need print statements in between each statement letting the user know what is going on. Cutting back on recreating the same variable over and over will cut a lot of code out.

These won't improve performance*, but will make it more readable, less code to manage, and less code to mess up.

*It may insignificantly improve by nanoseconds by not having to re-create the same variable over and over again!

PeterWilson · ‎02-26-2016

Hi Jason

Thanks for your feedback, its truly appreciated as I'm still a novice when it comes to programming. With regards to your suggestions. Would you mind giving me an example to explain your first suggestion:

Group all of your methods at the top under imports instead of co-mingling functions and statements.

I only realized once you pointed it out that I was defining my variables multiple times, thanks for pointing it out.

Lastly could you give me some advice how I could create a function that would build my input and output arguments based on the raster workspace and fgdb.

JasonTipton · ‎02-27-2016

# imports at top
import os
import arcpy
# etc. . .

# define methods here
def fill_sinks():
    # fill sink code

def flow_direction():
    # flow direction code

def flow_accumulation():
    # flow accumulation code

# etc...

# write your statements inline here, and better yet test for "__main__" first
# to only execute this code if you are executing this file

if __name__ == '__main__'
    fill_sinks()
    flow_direction()
    flow_accumulation()
     # etc . . .

I

PeterWilson · ‎02-27-2016

Hi Jason

After fixing the line 98 incorrect allocation of fac (Flow Accumulation) to Flow Direction, my script completes within the same amount of time as my previous script. I'm going to look into developing a function to allocate the variables automatically as this is standardised. The only input required from the user is the:

Input DEM
Raster Folder
File Geodatabase

JasonTipton · ‎02-29-2016

Since this isn't a class, only a script, you really don't need to put this in a method. How you did it in your original script would work just fine:

# archydro variables 
fill_sinks = os.path.join(rasWs, "fil") 
flow_dir = os.path.join(rasWs, "fdr") 
flow_acc = os.path.join(rasWs, "fac") 
streams = os.path.join(rasWs, "str") 
stream_seg = os.path.join(rasWs, "strlnk") 
catchment_grid = os.path.join(rasWs, "cat") 
catchment_poly = os.path.join(outWs, "Layers","Catchment") 
drainage_line = os.path.join(outWs, "Layers", "DrainageLine") 
adj_catch = os.path.join(outWs, "Layers", "AdjointCatchment")

If you prefer Path, just replace os.path.join(x,y,z) with str(Path(x,y,z)).

Honestly, for what you are doing, I think you might have over-engineered your solution on the 2nd script. The 1st one didn't look too bad. For performance, make sure everything is working in memory like other people have suggested. Only output it to disk when you are through processing it.

DarylVan_Dyke · ‎02-26-2016

Hi Peter-

Your observations about read/write speeds are really important for effective geoprocessing. Since you running in a stand-alone script, you can use the "IN_MEMORY" workspace to speed up many transactions. I'll let you read up on it, but essentially you create a virtual GDB called "IN_MEMORY" and set your environment workspace to reference it. For vector operations you can see huge improvements - raster as well, if you have the memory to support it.

More generally, I'd look at the read-write configurations on your workstation. Reading and writing to a server will be slow, reading and writing to a local SSD will be fast. You will also see improvements if you can separate out read and write media - e.g. read from one SSD, write to another. If you have lots of RAM, consider defining a RAM disk at the OS level - this will be virtual hard drive that occupies RAM, and will be Very, Very fast. But not as much room (typically ~4 GB for the free RamDisk tools).

As mentioned above, there are system profiling tools in Python to diagnose the relative cost of different implementations. You can see dramatic changes in Python processing time by changing from C-based/C-optimized intrinsic functions, to true-Python functions. Again, you can read up on that (especially with respect to iterators and memory models for hash tables).

Have fun!

Daryl

BrianGelder · ‎02-26-2016

I'll second Daryl's comments and add that 'IN_MEMORY' can also vastly speed up table operations such as joins, especially when both datasets are 'IN_MEMORY'. The improvements I have documented depend on the tool, but 2-20x improvements are quite common. However, for table manipulation it's still not nearly as fast as storing data in a true SQL database and then doing your SQL operations there. File servers are notoriously slow but I haven't tried splitting my reads/writes between SSDs or using RAMdisk. Thanks for the ideas!

Brian