Determine minimum distance between all points in two tables

JeffreyWilkerson · ‎07-07-2020

I have bus location information that is coming in daily (approx 50K of points a day) that I need to verify that the location cited is using the correct bus stop. In order to determine the nearest bus stop I was attempting to use SQL Server Spatial, but that's not as fast as I would like it to be.

I came across an older post about this (https://community.esri.com/message/939175-re-python-near-analysis-no-advanced-licence?commentID=9391...) and decided that Numpy became an option and I've built the code below to compare 2 different tables and return the id from the first one, and the corresponding id/minimum distance from the second one. I've worked the code below to get me the ID of the closest stop, but I'm looking for help in making sure this is the right way to go. This is what I'm requesting:

1. When I graduated from using input test featureclasses (now remarked out) to examples of the SQL Server tables I will be hitting, I noticed that the index for the minimum distance was always off by 1 (so I just added 1 to get the right one). I'm sure something is wrong there so any help in getting the right index of the minimum distance stop would help.

2. Also, I was told that I didn't need Pandas (by Dan Patterson) but I wasn't generating lat/longs yet as this is just a piece of the overall daily processing (most of it using SQL Server) and it doesn't need to be pushed to a featureclass.

3. I came across the current way of using numpy which seems to be just a little faster than the numpy.einsum function, but maybe not if it was all cleaned up more. All opinions are welcomed as I'm interested in knowing the best way to do this.

If you can provide any help I would sure appreciate it.

##Script: busStopDistances.py
##Author: Borrowed from Dan.Patterson@carleton.ca, and others
##Purpose: Numpy Repository stuff
##References:
## https://community.esri.com/thread/175646
## https://numpy.org/doc/stable/reference/generated/numpy.einsum.html
## https://towardsdatascience.com/how-fast-numpy-really-is-e9111df44347
## https://en.wikipedia.org/wiki/Einstein_notation
## https://pro.arcgis.com/en/pro-app/arcpy/data-access/tabletonumpyarray.htm
##

import os, sys
import numpy as np
import arcpy
import numpy.lib.recfunctions, scipy.spatial, random, cProfile
import pandas as pd

np.set_printoptions(edgeitems=3, linewidth=80, precision=6, suppress=True, threshold=10)

def npNear(f1,f2):
    f1shp_fields = ["DilaxID","LAT", "LON"] # DilaxBusTest SQL Server table
    #f1shp_fields = ["OID@", "SHAPE@XY", "POINT_Y", "POINT_X"] # RandomPoints FC
    f2shp_fields = ["OID@", "SHAPE@XY"] # Bus stop locatons FC
    a = arcpy.da.TableToNumPyArray(f1, f1shp_fields)
    #a = arcpy.da.FeatureClassToNumPyArray(f1, f1shp_fields)
    b = arcpy.da.FeatureClassToNumPyArray(f2, f2shp_fields)
    
    origs = np.array(pd.DataFrame(a[:][['LON', 'LAT']])) # How do w/o Pandas?
    #origs = a['SHAPE@XY'][:] # for RandomPoints FC
    dests = b['SHAPE@XY'][:]

    # Distance calculations using Numpy Einsum
##    subts = origs[:,None,:] - dests
##    dist = np.sqrt(np.einsum('ijk,ijk->ij',subts,subts))

    # Distance calculations using other Numpy functions
    P = np.add.outer(np.sum(origs**2, axis=1), np.sum(dests**2, axis=1))
    N = np.dot(origs, dests.T)
    dist = np.sqrt(P - 2*N)
    
    # print(dist) # for testing

    # Generate output array
    out = np.zeros((len(dist),3)) 
    out[:,0] = a['DilaxID'][:] # the initial table's ID field (f1, a, origs)
    out[:,1] = np.argmin(dist, axis=1) + 1  # minimum value from the distance calc
        # Adding 1 to get the correct bus stop
    out[:,2] = np.amin(dist, axis=1) # minimum value's id, to be associated to later
    return out
                                                
if __name__=="__main__":

    # Featureclasses for testing
    #f1 = r"d:\Projects\GTFS_Update_Checker\RandomPoints.shp"
    #f2 = r"d:\Projects\GTFS_Update_Checker\Random2.shp"

    # SQL Server table/fc for testing
    f1 = os.path.join(sys.path[0], r"Ridership on W-SQL01.sde\Ridership.dbo.DilaxBusTest")
    f2 = os.path.join(sys.path[0],r"GTFS on W-SQL01.sde\GTFS.dbo.BusStopsWM")

    # Final table for results
    outPath = os.path.join(sys.path[0], r"Ridership on W-SQL01.sde\Ridership.dbo.DilaxBusDistTest")

    start = time.time()  # for determining processing time
    out = npNear(f1, f2) # Call to sub-routine for determining calculations

    # Send to output table in SQL Server
    if arcpy.Exists(outPath):
        arcpy.Delete_management(outPath)
    struct_out = numpy.core.records.fromarrays(
        out.transpose(),
        numpy.dtype([('dOID', numpy.int32), ('bOID', numpy.int32), ('distance', '<f8')]))
    arcpy.da.NumPyArrayToTable(struct_out, outPath)
    
    end = time.time()  # end processing time
    print(out)
    print(end - start)‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍‍