Introducing the new Nearest Neighbors tool in GeoAnalytics Engine

arif-masrur · ‎02-14-2023

Analyzing spatial relationships between geographic features is a common task in spatial data science. In the retail and real estate industries, organizations may ask “Which of the spatial features is closest to another feature of interest?” to determine optimum locations for new business developments. For example, which public transit stations are nearest to each bank branch? What schools are near a neighborhood that has the highest crime rate? What points-of-interests (POIs) are nearest to each school? Answering such questions quickly becomes a spatial optimization problem that requires performing Nearest Neighbors analysis.

The Nearest Neighbors tool in GeoAnalytics Engine v1.1 makes it possible to identify the closest spatial features in a DataFrame to query feature(s) in another DataFrame. To run the Nearest Neighbors tool, you need one input DataFrame that represents the ‘Query’ DataFrame, which is the dataset containing the locations to which you want to find what is closest. Next, you can search for nearest features from a second input DataFrame, referred to as the ‘Data’ DataFrame in the graphic below.

Fig 1. Example input DataFrames and their nearest neighbors based on ‘number of neighbors’ parameter.

For either approach, you need to set the number of neighbors (k) to find which are nearest to the query record. The output will contain each record joined with other nearby records, excluding itself. Two formats of the output layout are supported: long and wide. Each row of the long output contains a query record with a single nearest neighbor whereas a wide output row contains a query record with all the nearest neighbors.

The code snippets below demonstrate an example use case of Nearest Neighbors using 1) SafeGraph points of interest (POIs) data, 2) school data, and 3) County data. The question answered is: What point-of-interests (POIs) are predominantly located near schools in the United States?

# Imports

from geoanalytics.tools import NearestNeighbors, Clip
from geoanalytics.sql import functions as ST
from pyspark.sql import functions as F
spark.conf.set("spark.databricks.delta.formatCheck.enabled", "false")

schools_data_path = "https://services1.arcgis.com/Ua5sjt3LWTPigjyD/arcgis/rest/services/Public_School_Location_201819/FeatureServer/0"
counties_data_path = "https://services.arcgis.com/P3ePLMYs2RVChkJx/arcgis/rest/services/USA_Counties_Generalized_Boundaries/FeatureServer/0"
safegraph_poi = spark.read.option("header", True).option("escape", "\"").csv("/safegraph-demo/core_poi-geometry/2022/*.csv.gz")

# Read and geo-process data

schools_df = spark.read.format("feature-service").load(schools_data_path) \
                  .withColumn("shape", ST.transform("shape", 26911)) \
                  .select("NCESSCH","NAME","STREET","CITY","shape")

poi_df = (safegraph_poi
                .withColumn("shape", ST.transform("Poly", 26911)) \
                .select("placekey","top_category","sub_category","brands","shape"))

# Run Nearest Neighbors tool to identify the 4 closest POIs near each school within 1 kilometer

print("This is the long-format layout for the output:")
result_long = NearestNeighbors() \
            .setNumNeighbors(4) \
            .setSearchDistance(1, "Kilometer") \
            .setResultLayout("long") \
            .run(schools_df, poi_df)
#result_long.show(12)

print("This is the wide-format layout for the output:")
result_wide = NearestNeighbors() \
            .setNumNeighbors(4) \
            .setSearchDistance(1, "Kilometer") \
            .setResultLayout("wide") \
            .run(schools_df, poi_df)
#result_wide.show(5, False)

The 'result_long' output DataFrame contains all columns from both input DataFrames as well as two new columns: near_rank (i.e., the near rank of the nearest neighbors which is given according to ascending order distance) and near_distance.

Next, we create a linestring geometry column to connect between each school and its nearest neighbor POIs.

result_long_line = (result_long
                    .withColumn("shortest_line", ST.shortest_line("shape", "shape1").alias("shortest_line")))

Using the code snippet below we visualize these three geometries: schools, near rank lines, and POIs (see Figure 2).

# Visualization

boundingbox = result_long_line.selectExpr("NCESSCH", "NAME", "STREET", "near_rank", "near_distance", "placekey", "top_category", "sub_category", "brands", "shape", "shape1", "shortest_line", "ST_EnvIntersects(shape1, 348534.21, 3799841.19, 361078.99, 3812818.39) as Filter")
result_long_line_sub = boundingbox.filter(boundingbox['Filter'] == True)

school_area = result_long_line_sub.st.plot(geometry="shape",
                                    color="black", marker_size=5,
                                    figsize=(16, 10))

school_area.set(xlim=(352000.00, 364078.99), ylim=(3804000.19, 3812818.39))
plot_lines = result_long_line_sub.st.plot(geometry="shortest_line",
                                    facecolor="none",
                                    edgecolor="lightblue",
                                    basemap="light",
                                    figsize=(16, 10), ax=school_area)

result_plot = result_long_line_sub.st.plot(geometry="shape1",
                                  is_categorical=True,
                                  cmap_values="top_category",
                                  cmap="Greens",
                                  basemap="light", ax=school_area)

result_plot.set_title("Searching for four nearest POIs around schools within 1 Km search distance")
result_plot.set_xlabel("X (Meters)")
result_plot.set_ylabel("Y (Meters)")

Figure 2. A zoomed-in view of the Nearest Neighbors tool output displayed on a map: points, lines, and polygons represent schools, distance between individual schools and POIs within 1-km search radius, and POI types, respectively.

The Nearest Neighbors tool outputs were then published as feature services (using the code snippet below) and then styled in ArcGIS Online. You can explore the output for the entire United States in this interactive web app.

# Publish output

from arcgis.gis import GIS
gis = GIS(username="myusername", password="mypassword")

result_lyr = result_long.select("NCESSCH", "NAME", "STREET", "near_rank", "near_distance", "placekey", "top_category", "sub_category", "brands", "shape1")
sdf = result_lyr.st.to_pandas_sdf()
lyr = sdf.spatial.to_featurelayer('NN_result_POIs_US')

So, what point-of-interests (POIs) are predominantly located near schools in the United States? To answer that, we summarize the output by POI category and near rank, using the code snippet below. Figure 3 shows that the closest POI category that is most frequently found within 1-km from the schools is the child day care services, followed by religious organizations.

# Group NN output by POI category and near rank for the entire US
 
result_long_group = result_long_.groupBy("top_category",  "near_rank").count()
result_long_group1 = result_long_group.where("top_category != 'Elementary and Secondary Schools'")

windowDept = Window.partitionBy("near_rank").orderBy(col("count").desc())
result_long_group_sub = result_long_group1.withColumn("row",row_number().over(windowDept)) \
  .filter(col("row") == 1).drop("row")

Figure 3. SafeGraph POI counts by near rank around schools in the United States.

We hope this helps give you some of the ins and outs of using Nearest Neighbors in your workflows! Don’t hesitate to let us know in the comments if you have questions or are interested in hearing more about other GeoAnalytics Engine tools and functions.