5 million Point buffer analysis

GaneshmoorthiM · ‎12-25-2015

Hi,

I have 5 million points and I would like to generate buffer and find the points that falls in each buffer region and perform some calculation.

I have tried with Spatial join and getting out of memory exception and finally I am trying with Multiprocessing.

I would like to know, Is there any other way we can do this more efficiently in python or any other method.

ChrisSnyder · ‎12-28-2015

The tool 'Generate Near Table' (BTW: only available at the ArcInfo license level) would also work for something like this... Basically it builds a distance-based one-to-many association table. A warning that the resulting output table is often many orders of magnitude larger than the number of input features - the explanation being that one house can be within 100 miles of many different restaurants. Be sure to apply a search radius! As a experiment, you might 1st apply a very small search radius (< 1 mile) so as to see the format of the output data you will be dealing with.... before you accidently end up with an output table that is 300 million records.

If you are running out of RAM, consider installing the 64-bit background geoprocessor: Background Geoprocessing (64-bit)—Help | ArcGIS for Desktop

Not sure exactly what you are doing here, but I think initially if I were doing this in a script, I would just loop through the restaurants, and select the houses, gather their stats, and write out some result. Something like:

foodPnts = r"C:\temp\test.gdb\restaurants"
housePnts = r"C:\temp\test.gdb\houses"
arcpy.MakeFeatureLayer_management(housePnts, "fl")
searchRows = arcpy.da.SearchRows(foodPnts, ["SHAPE@", "RESTAURANT_ID"])
for searchRow in searchRows:
  shapeObj, restaurantId = searchRow
  arcpy.SelectLayerByLocation_management("fl", "INTESECT", shapeObj, "100 MILES")
  recordCount = int(arcpy.GetCount_management("fl").getOutput(0))
  print("There are " + str(recordCount) + " houses within 100 miles of restaurant #" + str(restaurantId))
  #using the selected features/records in "fl", you now have a hook so as to get their names, incomes, etc.
  #blah, blah...
del searchRow, searchRows

GaneshmoorthiM · ‎12-29-2015

Hi Dan, Chris,

The scenario which I explained in my previous post was example scenario.

I am using following code to find the customer within 200-mile radius from our distribution center.

It took almost 2 hr to analyse 250 distribution center points against 0.4 Millon customer locations. Is there anyway we can improve the performance

import arcpy

DC_Points = r"C:\temp\TestGDB1.gdb\DistributionSites"  
CL_Points = r"C:\temp\TestGDB1.gdb\CustomerLoaction"  
fcSumPoint=r"C:\temp\TestGDB1.gdb\fcSumPoint"
arcpy.MakeFeatureLayer_management(CL_Points, "cl_points")  
searchRows = arcpy.da.SearchCursor(DC_Points, ["SHAPE@", "site_id","cov1val"])
print("searchcursor")
fields=("SHAPE@","dc_Id","cl_count","acSum")
inCursor = arcpy.da.InsertCursor(fcSumPoint, fields)
print("insert cursor")
for searchRow in searchRows: 
  shapeObj,dc_id,AcSUm = searchRow
 buffSelectionLyr=arcpy.SelectLayerByLocation_management("cl_points", "INTERSECT", shapeObj, "20 MILES","NEW_SELECTION")  
  recordCount = int(arcpy.GetCount_management("cl_points").getOutput(0)) 
  sumValues = 0 
  sumValues += AcSUm
 print(siteid,recordCount,sumValues)
 infiled=[shapeObj,siteid,recordCount,sumValues]
 inCursor.insertRow(infiled)
  print("There are " + str(recordCount) + " points in #" + str(siteid)) 
  #using the selected features/records in "flyou now have a hook so as to get their names, incomes, etc.  
del searchRow, searchRows, inCursor

DanPatterson_Retired · ‎12-29-2015

not bad...unless you have to do it daily... If you have lots of memory (ie 16-32 Gig) and ArcGIS Pro, why don't you benchmark it in that since, it should run without alteration in that environment. It would be an interesting comparison. You are using the Python 3.4.x version of the print statement already, which is good, and I can't see anything else that would cause fault in the next python as well

PS

print("some stuff {} plus more stuff {}".format(first_stuff,second_stuff))

takes care of type conversion the { } brackets are sequential. See Python's mini-formatting language for more details or one of my blogs on formatting.

GaneshmoorthiM · ‎01-01-2016

I have to run the analysis on daily and it is taking more time (for 95000 points it took almost 9 hrs). I will try with ArcGIS Pro.

one more question, Is there any other way to we can improve the performance

DanPatterson_Retired · ‎01-01-2016

As indicated previously, you will have to narrow your search from all the locations to those just near the point-buffer you are looking at. You won't be able to do all for all at once within an arcmap environment

GaneshmoorthiM · ‎01-01-2016

I am executing the script as a batch process. I have tried to minimise the candidates, but still it is huge data.

DanPatterson_Retired · ‎01-01-2016

Then you will need to prune your data set based on some criteria. We still don't know exactly what you are doing and why you need every site within your buffer.

GaneshmoorthiM · ‎01-01-2016

Each candidate in the buffer zone has its own statistical value. So, I have to find all candidates in given buffer, get each candidate value and perform additional statistical calculation like sum. It is points within polygon analysis. the only problem which I am facing is performance and out of memory issue.

DanPatterson_Retired · ‎01-01-2016

Then you need to delimit the candidates whose values change.

Step 1 would be to take a snapshot and determine the "value" for each candidate at a point in time. That is your benchmark. Do the long hard work of deriving he benchmarks for the buffers. This is easyto do. You could obtain millions of values in microseconds using arcpy and/or numpy (see my previous posts).
Step 2 get the statistical parameter for all candidates at time step 2. Ditto on the quick thing.
Step 3 determine the candidates that changed... a simple difference... incredibly fast
perform your buffer query on only the candidates in step 3 ... should be fast for all buffers since your candidates of interest are now greatly reduced.
final step, take the changes per buffer and apply them to the benchmark

So if your buffer size and locations remain the same, the problem can be greatly reduced to a smaller set by only querying for the 'changelings' after your benchmark conditions are determined.

ChrisSnyder · ‎01-04-2016

Yes, there is a way to improve performance... Like Dan mentioned, don't use ArcGIS (directly at least).

Here is another approach leveraging Python dictionaries. Note that this code only works for planar coordinates (UTM, State Plane, etc.) so if your data is in Geographic coordinates you need to use another distance function! BTW: This code assumes all fields are fully populated with valid values - if not you'll have to handle those exceptions.

Numpy has some nice ways to do all this stuff too - I usually prefer to roll my own though .

import math
def getDist(x1, y1, x2, y2):
   return math.sqrt((x1-x2)**2 + (y1-y2)**2)
sitesDict = {r[0],r[1]:r[2:] for r in arcpy.da.SearchCursor(sitesFC, ["SHAPE@X","SHAPE@Y","RESTAURANT_ID"])}
customersDict = {r[0],r[1]:r[2:] for r in arcpy.da.SearchCursor(custFC, ["SHAPE@X","SHAPE@Y","INCOME"])}
for x1, y1 in sitesDict:
  incomeList = []
  for x2, y2, in customersDict:
      if getDist(x1, y1, x2, y2) <= 1000: #or whatever map unit theashold...
         income = customersDict[(x2, y2)][0])
         if income > 0:
               incomeList.append(customersDict[(x2, y2)][0])
               #blah
   incomeListSum = sum(incomeList)
   incomeListLen = len(incomeList)
   print ("Restaurant #" + str(sitesDict[(x1,y1)][0] + " has " + str(incomeListLen) + " customers - average income is " = str(incomeListSum / float(incomeListLen)))