Select to view content in your preferred language

Getting Started with GeoAnalytics for Fabric: Calculating new geometries & their properties

125
0
03-13-2025 08:00 AM
SBattersby
Esri Contributor
1 0 125

To help kick off the Public Preview of ArcGIS GeoAnalytics for Microsoft Fabric, we will have a series of blog posts exploring the functionality in GeoAnalytics for Fabric.  In this post, we will explore the fundamentals of calculating / constructing new geometries such as buffers, spatial bins, and shortest paths.  We'll also look at calculating properties like length and area of these new geometries.

Importing GeoAnalytics for Fabric

We'll start with a refresher on how to import GeoAnalytics for Fabric and start working with it. To import GeoAnalytics for Fabric you just need to add import geoanalytics_fabric to a cell in your notebook.  For more information on enabling the library, see the Getting started documentation.

For convenience, we recommend also importing the geospatial functions directly, and to give them an easy to use alias for reference. In the example below, we import the library and the functions with an alias of ST. Since each of the functions are named ST.<function_name> this makes them easy to reference and identifies them as being Spatial Type (ST) functions. This will also align with the structure of examples throughout the ArcGIS GeoAnalytics documentation.  Your cell will look like this:

 

# import ArcGIS GeoAnalytics
import geoanalytics_fabric
import geoanalytics_fabric.sql.functions as ST

 

For instance, after importing, you would be able to reference the ST_Buffer function using ST.buffer instead of geoanalytics_fabric.sql.functions.buffer.

Calculating new geometries

GeoAnalytics for Fabric contains a number of spatial operations for generating new geometries based on the input. For instance, we will explore a few common functions in this post:

While we will explore a few functions in detail, there are many more in GeoAnalytics for Fabric library that may be of interest for your analytics projects. You can explore the range of functions in the documentation of SQL Functions.

Let's get started with some buffers!

ST_Buffer & ST_GeodesicBuffer

ST_Buffer and ST_GeodesicBuffer create polygons representing the area less than or equal to the distance specified around the input geometry (e.g., the region within 10 miles around a point). These functions will create this buffer around every entity in the geometry column.

To see how this works, we'll start by creating a new DataFrame with a few school locations in the Boston area. We will buffer those to visually look at the area around each school.  Here is an example of creating a new DataFrame with a few point locations:

 

# create a DataFrame with locations of several schools (in latitude and longitude)
df_schools = spark.createDataFrame([
    (-71.0777, 42.3388, "Hurley K-8"),
    (-71.0919,42.3176, "Higginson Elementary School"),
    (-71.1016, 42.3164, "Mendell Elementary School")], ["longitude", "latitude", "name"])

# create a point geometry column ("geometry") from the latitude and longitude values
df_schools = df_schools\
    .withColumn("geometry", ST.point("longitude", "latitude", sr=4326))

 

Here are the results:

 

SBattersby_0-1741106105158.png

Now we can create buffers around these schools.

The distance unit for the buffer is in the same units as the input geometry. For example, if the data is in the web Mercator coordinate system, the units will be meters; if your data is in a geographic coordinate system, the units will be decimal degrees.

The ST_GeodesicBuffer function can be used to create buffers using geodesic distance in meters.  Let's create a few buffers and take a look at them - and see how the planar buffer and geodesic buffer differ.

 

# create example buffers using ST_Buffer and ST_GeodesicBuffer
# This example generates a buffer of 0.5 _decimal degrees_ (the units for the input data are latitude/longitude values!)
# It also created a geodesic buffer of 500 meters
df_schools = df_schools\
    .withColumn("geometry_buffer", ST.buffer("geometry", 0.5))\
    .withColumn("geometry_geodesic_buffer", ST.geodesic_buffer("geometry", 500))

 

First, let's look at the planar buffer - which was calculated in the units of the input geometry (decimal degrees).  This is an 0.5 decimal degree buffer around the three schools:

SBattersby_3-1741106300799.png

 

Note that I left the axis labels on the map so that you can see that this is plotting on a basemap in decimal degrees.

And now, let's look at the geodesic buffers around the schools.  Note that this is also plotting on a basemap in decimal degrees.  We can see that the buffers are much smaller (500m around each school), and that they are not circular.  Let's explore why this is...

SBattersby_2-1741106260267.png

There are a few key things to notice when plotting these buffers:

  • The buffers in the first plot look huge, while the buffers in the second are much smaller. The "geometry_buffer" is half a decimal degree, while the "geometry_geodesic_buffer" is just 500 meters. One decimal degree is significantly longer than 500 meters. At the equator, a degree of latitude or longitude is close to 111 miles!

 

  • The buffers in the first plot ("geometry_buffer") look circular while the buffers in the second plot ("geometry_geodesic_buffer") look like ovals. Why is this? Both of these maps are generated on a map based on the latitude and longitude coordinates. In these plots, one degree of latitude is equal to one degree of longitude in length. On the earth, the length of one degree of latitude and longitude differ, with a degree of longitude becoming a shorter and shorter length as you get closer to the poles. The Wikipedia entry on Longitude discusses this in nice detail if you would like more information. When the geodesic buffers are plotted on this particular map projection, they appear stretched out into ovals, though, on the earth's surface they would be circular.

If we were to plot the geodesic buffer in our local Boston coordinate system (SRID 2249), the buffers look like the circles that we would expect.

We can update the map when we plot it by adding an sr (spatial reference) parameter. The example code below demonstrates this. More information on the parameters for st.plot can be found in the API reference.

 

# plotting the geometry_geodesic_buffer on a map with spatial reference of 2249
df_schools\
    .st.plot(basemap="streets", geometry="geometry_geodesic_buffer", alpha=0.5, edgecolor="black", sr=2249);

 

In general, it is faster to run ST_Buffer vs. ST_GeodesicBuffer as the buffers are calculated using planar coordinates instead of geodesic coordinates. However, this means that you should select an appropriate local coordinate system prior to using ST_Buffer.

For example, we can translate the latitude and longitude-based points to a local coordinate system like the Massachusetts State Plane Coordinate System (SRID: 2249) and then create our buffers. We can do this using the ST_Transform function as part of the buffer creation.  We'll show an example of that here:

 

# create 500 foot buffers around the school coordinates using a local coordinate system
df_schools = df_schools\
    .withColumn("geometry_buffer_2249", ST.buffer(ST.transform("geometry", 2249), 500))

 

When transforming data between coordinate systems, it is important to know the units that are used. In this case, the units are feet. You can look this up with a web site like SpatialReference.org. Here is the entry for SRID 2249.

We can then plot those buffers and see the difference in the map.  Note that the units on the axis are different - because our data is now in the Massachusetts State Plane Coordinate System (SRID: 2249).

 

# plot the buffer calculated in feet
df_schools.st.plot(basemap="streets", geometry="geometry_buffer_2249", alpha=0.5, edgecolor="black");

 

SBattersby_4-1741106542832.png

Now that we have nice buffers, let's take a little detour before we move on to calculating other interesting geometries.  Let's quickly look at how we can use these buffers for analysis.

We'll use the buffers to identify 311 public service requests from the city of Boston that are near to each school.  To demonstrate this, we'll use a dataset of public safety data from the city of Boston, MA from the Azure open datasets that we've looked at in other posts.  

As a reminder, here is how we can import that data into a DataFrame:

 

# https://learn.microsoft.com/en-us/azure/open-datasets/dataset-boston-safety?tabs=pyspark
# Azure storage access info
blob_account_name = "azureopendatastorage"
blob_container_name = "citydatacontainer"
blob_relative_path = "Safety/Release/city=Boston"
blob_sas_token = r""

# Allow SPARK to read from Blob remotely
wasbs_path = 'wasbs://%s@%s.blob.core.windows.net/%s' % (blob_container_name, blob_account_name, blob_relative_path)
spark.conf.set(
  'fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name),
  blob_sas_token)
print('Remote blob path: ' + wasbs_path)

# SPARK read parquet
df = spark.read.parquet(wasbs_path)

 

Let's find all of the service requests that are near to each school using the buffers that we created.  We'll do that and plot the results in the same bit of code for simplicity, though, for using these types of results in analysis, you might want to create a new DataFrame or add a column with the results instead.

In this example we use a spatial join based on ST_Contains.  This finds all of the service call requests that are contained by the buffers around the schools.  This is just a simple example; we will discuss more about spatial joins in a future blog post...

 

# plot the results and color code the service requests based on which school they are near
df.join(df_schools.drop("geometry"), ST.contains("geometry_geodesic_buffer", "geometry"))\
    .st.plot(basemap="streets", geometry="geometry", marker_size=5, cmap_values="name");

 

SBattersby_7-1741106898823.png

 

In this example, the service requests around each school are color encoded based on the school they are near.  So we have just created a geometry (Buffer) and used it to identify locations of interest (near to schools).  There is more than one way to do this type of analysis, but this shows a simple example using buffer polygons.

Let's move on and look at some other interesting geometries that we can generate with GeoAnalytics for Fabric.

 

Aggregation with Bin functions

With large point datasets where many features overlap, grouping the features into a polygon bin (e.g., square, hexagon, or administrative polygon like a Census tract) with aggregated attributes (e.g., count of features in the bin, average value for an attribute, etc.) can help in simplifying and visualizing patterns.

GeoAnalytics includes three methods for spatial binning:

In this post, we will just look at one example, but there are more details on the range of spatial binning capabilities with GeoAnalytics for Fabric in the developer documentation.

Since we imported the Boston service request dataset when we were looking at using buffers for a spatial join, let's keep using that dataset as an example.  It's large enough that spatial binning might uncover patterns that we can't see by just looking at the point data by itself.

Identifying a bin for each record

We will start by looking at the ST_H3Bin function to identify the H3 bin that each of our service request points is inside. We will then use a PySPark GroupBy function to make a simplified DataFrame that lists the count of points inside each bin.

The size of H3 bins is based on a resolution value. For this example, we will use a resolution of 8, which equates to an average hexagonal bin size of 0.73 square kilometers. More information on H3 bin resolutions can be found here.

Adding the ID for a bin is as easy as creating a new column for the data and using the bin function for the geometry type you want (hexagonal, square, or H3):

 

# add a column with the H3 bin _id_ that each point falls inside
df = df\
    .withColumn("h3_bin", ST.h3_bin("geometry", 8))

 

The ST_H3Bin function adds a bin id to each record. It doesn't add the geometry specifically, as there would be many duplicates of the geometry (one for each point inside the bin) and it is more efficient to work with the ID until we need the geometries for analysis or visualization.  We can see these new bin IDs in the table below:

SBattersby_8-1741107215105.png

 

SBattersby_9-1741107242143.png

Using the h3_bin ID field, we can then group the values in our table to count the records in each bin.

 

# group the service requests DataFrame using the h3 bin ID and count all records
df_by_bin = df.groupBy("h3_bin").count()

 

After grouping, we can see that each bin ID has a count of the number of points inside it.  After we do this, we can then convert the bin IDs into geometries and visualize the results.

SBattersby_10-1741107319422.png

Here is how we add the bin geometry into the tale using the ST_BinGeometry function:

 

# add the bin geometry based on the bin ID using ST_BinGeometry
df_by_bin = df_by_bin\
    .withColumn("geometry_bin", ST.bin_geometry("h3_bin"))

 

Plotting the results now shows us where the greatest concentration of service requests occurs:

SBattersby_11-1741107444234.png

 

Now let's move on to looking at calculating some lines and distances between locations!

ST_ShortestLine and ST_GeodesicLine

ST_ShortestLine returns a linestring column representing the shortest line that touches two geometries, using planar distance calculation. This function returns only one shortest line if there are more than one that can be calculated between locations.

To create a shortest line using geodesic distance calculations, use ST_GeodesicShortestLine.

As an example of this, we will look at the shortest line between the schools DataFrame that we created earlier and the service requests that are nearby. Shortest line will use two geometry columns as input, so they need to be in the same DataFrame.

In the buffer examples above, we used ST_Contains to find all of the points inside the buffer polygon.

In this example, we will use another spatial calculation called ST_DWithin (distance within) to find all points within a specified distance of the school points. Effectively this can give the same result, however DWithin is often more performant than creating a buffer and performing a spatial join. DWithin uses planar calculations by default, so we should either transform the geometry into a local coordinate system, or set the parameter to use geodesic calculations.

In this case, we will use ST_Transform to update the coordinate system for this calculation. We will do this in a single calculation where we join the schools and service requests and create a new DataFrame with the results:

 

 

 

# find all points within 500 feet of each school 
# use a spatial join to match any point that is within the specified distance of the school
# since the datasets are originally in latitude/longitude coordinates, we transform to a local coordinate system first 
df_schools_requests = df.join(df_schools.select("name", F.col("geometry").alias("geometry_school")), 
    ST.dwithin(ST.transform("geometry", 2249), ST.transform("geometry_school", 2249), 500))

 

Now that each school has the geometry for the school and the points nearby, we can use those two geometries to calculate the shortest line between each service request point and the school.

 

# calculate the shortest line from each point to each school
df_schools_requests = df_schools_requests\
    .withColumn("geometry_line", ST.shortest_line("geometry", "geometry_school"))

 

Let's take a look at the result for one school:

 

# plot the connector lines between one school and all of the points within 500 feet of this location
df_schools_requests.filter(F.col("name") == "Hurley K-8").st.plot(basemap="streets", geometry="geometry_line", linewidth=0.5, alpha=0.7);

 

SBattersby_0-1741111716010.png

For short distances, ST_ShortestLine is a good choice, but for longer distances, you might consider using ST_GeodesicShortestLine so that the line that is created represents the shortest path on the sphere. As an example of the difference between the "shortest line" (planar calculation) and the geodesic line, we can explore the difference between a line connected between Los Angeles and London. In this example, the geodesic line (orange) appears curved and longer in this projection, but it is actually the shortest line!

 

# geodesic line (geodesic) (in orange) vs. "shortest line" (planar) (in blue)
# the geodesic line appears curved and longer in this projection, but it is actually the shortest line!

# create points for two locations
data = [
    ("POINT (-118.2426 34.0549 )", "Los Angeles", 
     "POINT (0.1276 51.5072 )", "London")
]

# create a DataFrame with the two locations
# convert the locations from well-known text to geometry
df_lines = spark.createDataFrame(data, ["wkt-origin", "city-origin", "wkt-dest", "city-dest"])\
          .select("city-origin", 
                  "city-dest", 
                  ST.geom_from_text("wkt-origin", 4326).alias("geometry-origin"), 
                  ST.geom_from_text("wkt-dest", 4326).alias("geometry-dest"))


# plot the results to compare the two lines generated (geodesic vs. shortest)
myplt = df_lines.select(ST.geodesic_shortest_line("geometry-origin", "geometry-dest")).st.plot(basemap="light", figsize=(10,10), color="orange")
df_lines.select(ST.shortest_line("geometry-origin", "geometry-dest")).st.plot(ax=myplt);

 

SBattersby_1-1741111775196.png

Distance and Length calculations

Now let's look at distance and length for these lines...

  • ST_Distance and ST_GeodesicDistance will calculate the distance between two input locations. The unit for ST_Distance is the same as the input geometries. For ST_GeodesicDistance the distance is calculated in meters.
  • ST_Length and ST_GeodesicLength will calculate the length of an input feature. The unit for ST_Length is the same as the input geometry. For ST_GeodesicLength the length is calculated in meters.

We will take a look at these calculations using the lines that we generated earlier connecting school locations to the service requests that are nearby.

With the distance calculation, we can also look at a particularly nice feature of the GeoAnalytics library - projection on the fly. If you want to calculate a property or relationship using two geometries that are in different coordinate systems, the library can often reconcile this difference for you so that you don't need to transform the data.

  • The geometry (service requests) column is in the Massachusetts State Plane coordinate system (SRID 2249)
  • The geometry_school (school locations) column is in WGS84 latitude and longitude (SRID 4326)

Let's use these two geometries and calculate a variety of distance calculations and see how GeoAnalytics can work with geometries in different coordinate systems! In the cell below we are calculating:

  • Distance between the two geometries without transforming the coordinates
  • Distance between the two geometries with a transformation to convert the geometry_school to Massachusetts State Plane (2249)
  • Geodesic distance between the two geometries without a transformation

We will also calculate the length of the line geometry_line that we created earlier to connect the two point locations.  We'll perform these calculations and display the results as an example:

 

 

 

display(
    df_schools_requests.select("geometry", 
                            "geometry_school", 
                            F.round(ST.distance("geometry", "geometry_school"), 2).alias("Distance (no transform)"),
                            F.round(ST.distance("geometry", ST.transform("geometry_school", 2249)),2).alias("Distance (transform)"),
                            F.round(ST.geodesic_distance("geometry", "geometry_school"),2).alias("GeodesicDistance (meters)"),
                            "geometry_line",
                            F.round(ST.length("geometry_line"),2).alias("Length (feet)"),
                            F.round(ST.geodesic_length("geometry_line"),2).alias("GeodesicLength (meters)")
                            )
)

 

SBattersby_2-1741112013886.png

This shows how you can perform these calculations, even with geometries in different coordinate systems!  GeoAnalytics for Fabric is great at helping mitigate difference in coordinate systems wherever possible.

Now let's take one last look at some of the properties we can calculate and explore area.

ST_Area and ST_GeodesicArea

ST_Area - calculates a planar area for each geometry. The unit of the area is the same as the input geometries. If you calculate area on a dataset in a geographic coordinate system, the units will be square decimal degrees. This is not a recommended unit. For using the ST_Area function it is best to use a projected coordinate system. More information is available in the Coordinate Systems and Transformations core concept document.

ST_GeodesicArea calculates area for each geometry in square meters, and requires that a spatial reference is set on the input geometry column.

For this example, we will calculate areas for US states that we read in from an ArcGIS feature service.

 

# US states feature service from the Esri Living Atlas of the World
# For more information on this dataset: https://www.arcgis.com/home/item.html?id=774019f31f8549c39b5c72f149bbe74e
fs_url = "https://services.arcgis.com/P3ePLMYs2RVChkJx/arcgis/rest/services/USA_Census_States/FeatureServer/0"

df_states = spark.read.format('feature-service').load(fs_url)

 

Before calculating the area, we should check the coordinate system of the data so that we know more about the units that are used in the coordinate system and decide if we are going to use ST_Area or ST_GeodesicArea. We can do this with the get_spatial_reference() function

 

df_states.st.get_spatial_reference()

 

Which returns information showing that the data has an SRID of 4326 (WGS84 coordinate system)

SpatialReference(srid=4326, is_projected=False, unit='Degree', wkt='GEOGCS["GCS_WGS_1984",DATUM["D_WGS_1984",SPHEROID["WGS_1984",6378137.0,298.257223563]],PRIMEM["Greenwich",0.0],UNIT["Degree",0.0174532925199433]]')

For this dataset, let's look at the values calculated by ST_Area and ST_GeodesicArea.

In the example below we display the original data for the state name and square miles, and three new calculated fields, all rounded to two decimal places for readability using PySpark Round()

  • area using ST_Area - since the spatial reference is 4326 with a unit of Degree, the results will be in square decimal degrees, which has no real applicable value
  • area using ST_Area on a transformed geometry using USA Contiguous Albers Equal Area projection (102003). This calculation returns square meters, so they are converted to square miles in the example below
  • area using ST_GeodesicArea on the original geometry. This calculation returns square meters, so they are converted to square miles in the example below

The values returned are all very close to one another in value, but not exactly the same. These deviations are due to distortion in calculation of area for large regions (the Coordinate Systems and Transformations core concept document provides more information on distortion in projections), as well as minor variations in precision that can be introduced with the projection and transformation process.

 

 

 

# When reading feature services, the geometry is generally in a column named "Shape."  This can be confirmed by looking at the displayed table above
display(
    df_states.select("STATE_NAME", 
                     "SQMI", 
                     F.round(ST.area("Shape"), 2).alias("Area (sq decimal degrees"), 
                     F.round(ST.area(ST.transform("Shape", 102003))/2.59e+6, 2).alias("Area (sqmi)"), 
                     F.round(ST.geodesic_area("Shape")/2.59e+6, 2).alias("GeodesicArea (sqmi)")
                     )
)

 

 

 

SBattersby_3-1741112393326.png

Conclusion

Hopefully this quick primer on creating new geometries and calculating their properties has been helpful.  We'll be posting additional content to help you get started, so check back on the Community for more!  And please let us know what type of things you'd like to learn more about!

 

Contributors