william.winner_noaa

Using Shapely and Fiona to buffer

Blog Post created by william.winner_noaa on Sep 17, 2019

I have been working on a way to take US Army Corps of Engineers (USACE) publicly available survey data and use it to determine the appropriate depths for charting the channels.  The USACE does not always survey the entire channel which leaves us with questions on how to update a single value representing the entire channel.

 

To help with this, I ended up turning to Shapely and Fiona APIs for Python.

Background

The USACE makes their survey data available through their eHydro platform.  My question was, how can I use these datasets to automatically determine what the shoalest value in a given channel was?  The USACE regularly changes the extents of their channel reaches which means that one survey on a given reach may later need to be used for a different reach.

 

The aim was to be able to take a given polygon, query a master database of USACE survey data (which is created from the geodatabases available on eHydro) for all data that falls within that polygon, then somehow create a buffer around that data to determine the percent of coverage.

 

The Process

I originally built everything using built-in ArcGIS processes.  This included:

  1. SelectLayerByLocation: Used to select the points that fall within the polygon
  2. GenerateNearTable: Used to determine the distance between nearest points.  I ended up searching for the nearest two points
  3. Statistics: Used to calculate the maximum distance between the nearest two points to create the original buffer
  4. Buffer: Create an initial buffer.  USACE survey data is generally individual lines perpendicular to the channel.  So, the goal was to outline individual survey lines
  5. MultipartToSinglepart: We needed to explode the buffer so that we would calculate the distance between survey lines
  6. FeatureToPoint: We converted the polygons to centroid points for quicker calculation of nearest points.
  7. GenereateNearTable: Used to determine the distance between individual survey lines
  8. Statistics: Used to calculate the mean and standard deviation of distance between survey lines.  In this case, we wanted to try to encapsulate most of the survey lines into a survey, but not all.
  9. Buffer: Create a buffer around the survey lines
  10. Clip the buffer to the original polygon

 

The Problem

ArcGIS's buffer is very slow.  One polygon that I ran took 17 hours to go through all of the above steps.  So, I wanted to find a way to speed up the process.

 

The Solution

Shapely!  Shapely has a much, much faster buffer.  I was also hoping to use it to calculate nearest points thinking that if it was faster for buffer, it would be faster for that as well...more on that later.

 

So, I had to figure out how to use shapely and fiona.  That took a lot of digging and a lot of trial and error.  The shapely documentation is not very full.  But eventually, I was able to figure it out and this is what my code looks like:

import arcpy
import os
from shapely.geometry import mapping, shape
from shapely.ops import cascaded_union
from fiona import collection

class MyClass:

    #example: self.BufferFeatureClass("in_memory\\Points","new_Buffer",5.4)
    def BufferFeatureClass(self, inputFC, outputShp, bufferDist):
        arcpy.AddMessage("BufferFeatureClass(" + inputFC + "," + outputShp + "," + str(bufferDist) + ")")
        # create a temporary shapefile
        tempCopy = "in_memory\\shapely_buffer_temp"

        # name and location for the temporary shapefile
        #  -have to create a physical shapefile for fiona
        newShpDir = "C:\\Temp"
        newShpFileName = "shapely_points1_temp"

        # check if C:\Temp exists
        boolDelTemp = False
        if not os.path.exists(newShpDir):
            try:
                os.mkdir(newShpDir)
                boolDelTemp = True
            except:
                arcpy.AddMessage("Unable to create directory C:\\Temp")

        # create the temporary shapefile
        arcpy.CopyFeatures_management(inputFC,tempCopy)

        # Determine if we need to project the data
        descSR = arcpy.Describe(tempCopy).spatialReference
        if descSR.GCS.name == descSR.name: #it's a gcs, project the data
            arcpy.AddMessage("Projecting the file")

            # Step 4.1: Project the data

            # Step 4.1.1: Determine UTM Zone (get the first data point and use it's longitude
            tempCursor = arcpy.da.SearchCursor(tempCopy,["SHAPE@X"])
            row = tempCursor.next()
            utmZone = int(31+(row[0]/6))
            del tempCursor

            # Step 4.1.2: Project the data into the UTM projection
            arcpy.AddMessage("file = " + os.path.join(newShpDir, newShpFileName + ".shp") + "\nUTM Zone = " + str(utmZone))
            arcpy.Project_management(tempCopy, os.path.join(newShpDir, newShpFileName + ".shp"), arcpy.SpatialReference("NAD 1983 UTM Zone " + str(utmZone) + "N"))
        else: # we don't need to project the data, it's already in a projected system
            arcpy.CopyFeatures_management(tempCopy, os.path.join(newShpDir, newShpFileName + ".shp"))

        # save the spatial reference for later
        outputSR = arcpy.Describe(os.path.join(newShpDir, newShpFileName + ".shp")).spatialReference

        with collection(os.path.join(newShpDir, newShpFileName + ".shp"), "r") as input:
            schema = { 'geometry': 'Polygon', 'properties': {'Buffer':'str'}}
            with collection(os.path.join(newShpDir, outputShp + ".shp"), "w", "ESRI Shapefile", schema) as output:
                # create a blank list to store the features
                features=[]

                # create new features with the selected buffer added
                for feature in input:
                    features.append(shape(feature['geometry']).buffer(bufferDist))
               
                # Merge all of the features into one feature
                merged = cascaded_union(features)

                # write the output to a shapefile
                output.write({
                    'properties': {
                        'Buffer': str(bufferDist)
                    },
                    'geometry': mapping(merged)
                })

        # delete all but the output shapefiles
        arcpy.Delete_management(tempCopy)
        arcpy.Delete_management(os.path.join(newShpDir, newShpFileName + ".shp"))

        # now we have to define the projection
        arcpy.DefineProjection_management(os.path.join(newShpDir, outputShp + ".shp"),outputSR)

        # create an "in_memory" copy of the data
        returnName = "in_memory\\" + outputShp
        arcpy.CopyFeatures_management(os.path.join(newShpDir, outputShp + ".shp"), returnName)

        # delete the output shapefile
        arcpy.Delete_management(os.path.join(newShpDir, outputShp + ".shp"))

        # delete the directory if we created
        if boolDelTemp:
            try:
                os.rmdir(newShpDir)
            except:
                arcpy.AddMessage("    -Could not delete temporary directory " + newShpDir)

        # return the location of the output file
        return returnName

 

This sped up the process greatly.  For one specific polygon, shapely took 3 min, 12 seconds to complete my full analysis (of which the above is just a part) and the ArcGIS buffer solution added 105 minutes.  That was with two buffer operations so each buffer added about 50 minutes each.  So substantial is a bit of an understatement.

 

The Non-Solution

As I mentioned, I thought I would try out using the shapely nearest_points algorithm to see if it sped up generating the near table.  One of the issues with the shapely solution, though, is that if you want to compare points within the same dataset, you can't.  If you use a point within the dataset to search for other points, it will return the original point with a 0 distance.  So, I had to come up with a way to extract the data point and then run the nearest point comparison.  My initial test was very promising as it seemed to do the comparison very quickly.  However, with a dataset of 200k points, it was going to take hours to run.  I think I know why, but I'll post the code first.

import arcpy
import os
from shapely.geometry import Point, MultiPoint
from shapely.ops import nearest_points
from fiona import collection
import time

# get input layer
lyrPts = arcpy.GetParameterAsText(0)

# create a temporary layer
tempShp = "C:\\temp\\temp_shape.shp"
arcpy.CopyFeatures_management(lyrPts, tempShp)

startTime = time.time()
arcpy.AddMessage("Calculating nearest using shapely...")

# create a blank list
orderedPts = []

# read in the input shapefile, create shapely Points from the input and add to the list
with collection(tempShp, "r") as input:
    for point in input:
        orderedPts.append(Point(point['geometry']['coordinates']))

# set up the progressor
listLen = len(orderedPts)
arcpy.SetProgressor("step", "Calculating Nearest Points: 0 of " + str(listLen) + " calculated",0,listLen,1)

# create the variables for storing the results
totalDistance1 = 0.0
totalDistance2 = 0.0
maxDistance1 = 0.0
maxDistance2 = 0.0

# go through each item in the list
for i in range(listLen):
    # pop out the current point
    curPoint = orderedPts.pop(i)

    # calculate the nearest
    nearPt1 = nearest_points(curPoint, MultiPoint(orderedPts))[1]

    # pop out that point
    nearPt1Index = orderedPts.index(nearPt1)
    nearPt1 = orderedPts.pop(nearPt1Index)

    # calculate second nearest
    nearPt2 = nearest_points(nearPt1, MultiPoint(orderedPts))[1]

    # add current point, and distance to two nearest points to stats
    distPt1 = curPoint.distance(nearPt1)
    distPt2 = curPoint.distance(nearPt2)
    totalDistance1 += distPt1
    totalDistance2 += distPt2
    if distPt1 > maxDistance1:
        maxDistance1 = distPt1
    if distPt2 > maxDistance2:
        maxDistance2 = distPt2

    # add both popped out points back to list
    orderedPts.insert(nearPt1Index, nearPt1)
    orderedPts.insert(i, curPoint)
    arcpy.SetProgressorPosition()
    if i%10 == 0:
        arcpy.SetProgressorLabel("Calculating Nearest Points: " + str(i) + " of " + str(listLen) + " calculated")

arcpy.AddMessage("Mean distance 1 = " + str(round(totalDistance1/listLen,2)))
arcpy.AddMessage("Mean distance 2 = " + str(round(totalDistance2/listLen,2)))
arcpy.AddMessage("Max distance 1 = " + str(round(maxDistance1,2)))
arcpy.AddMessage("Max distance 2 = " + str(round(maxDistance2,2)))

arcpy.AddMessage("Total time = " + str(time.time() - startTime) + " seconds")

 

I never did verify that it was getting the stats correctly because it was taking so long.  But, I did a lot of verification on individual points and I'm confident it would have given me the same results as GenerateNearTable looking for nearest two points and then running Statistics on that table.  But the shapely solution would have taken hours vs. minutes.

 

Conclusion

I mostly wanted to post this in case others were trying to do something similar and finding the resources available limiting.  My hope is that this helps someone else use shapely and fiona to buffer.

 

#python, #arcgis, #shapely, #shapely.nearest_points, #shapely.buffer, #fiona, #buffer, #GenerateNearTable #arcpy

Outcomes