To help kick off the Public Preview of ArcGIS GeoAnalytics for Microsoft Fabric, we have a series of blog posts exploring the functionality in GeoAnalytics for Fabric. In this post, we will explore analysis capabilities for exploring large point datasets in order to calculate trends and hotspots with GeoAnalytics for Fabric.
We'll start with a refresher on how to import GeoAnalytics for Fabric and start working with it. To import GeoAnalytics for Fabric you just need to add import geoanalytics_fabric to a cell in your notebook. For more information on enabling the library, see the Getting started documentation. There are also examples in the earlier blog posts in our introductory series.
For this post, we will work with a public safety data from the city of Boston, MA from the Azure open datasets. This dataset contains latitude and longitude coordinates for public safety service requests from Boston.
# https://learn.microsoft.com/en-us/azure/open-datasets/dataset-boston-safety?tabs=pyspark
# Azure storage access info
blob_account_name = "azureopendatastorage"
blob_container_name = "citydatacontainer"
blob_relative_path = "Safety/Release/city=Boston"
blob_sas_token = r""
# Allow SPARK to read from Blob remotely
wasbs_path = 'wasbs://%s@%s.blob.core.windows.net/%s' % (blob_container_name, blob_account_name, blob_relative_path)
spark.conf.set(
'fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name, blob_account_name),
blob_sas_token)
print('Remote blob path: ' + wasbs_path)
# SPARK read parquet
# use the GeoAnalytics for Fabric ST_Point function to create a geometry
df = spark.read.parquet(wasbs_path)\
.withColumn("geometry", ST.point("longitude", "latitude", 4326))
df.persist().count()
This dataset is fairly large - more than 24 million records! With large datasets, it is often difficult to visually identify patterns in the data by just looking at the data.
GeoAnalytics for Fabric can make it easier to explore these patterns through aggregation and other analytic techniques that we will explore in this post. Not only does it make it easier to see the patterns within your notebook experience, data aggregation is often an important step in preparing data for exploration in common business intelligence tools, like Power BI.
Let's start with aggregation.
It is common with point datasets to have enough data that the points on the map overlap and hide patterns in the distribution. This happens because the points overlap enough that you cannot visually identify the density - if the whole map is fully covered with data markers, there is no pattern other than a fully covered map.
Spatial aggregation can be useful in these cases to make it easier to identify general spatial patterns in point datasets. Spatial aggregation allows you to count up the number of points that fall into discrete geographic regions (e.g., squares, hexagons, school districts, census tracts, etc.) and then map the counts of points, or other aggregates of the attributes, into each of these polygons.
In addition to identifying spatial patterns by aggregating the data, this also helps for creating datasets for BI tools, like Power BI to more easily visualize the data in dashboards.
As we saw in the plot of our dataset above, there really is too much data to be able to clearly see a pattern. Even if we zoom in fairly closely, we still might have trouble with seeing patterns in this dense dataset. So, let's explore how spatial binning and aggregation can help.
To help make sense of the distribution, we will aggregate using the Aggregate Points tool in GeoAnalytics for Fabric. Aggregate Points summarizes points into polygons from other datasets, or bins (square, hexagonal, or H3). The boundaries from the polygons or bins are used to collect the points within each area and use them to calculate statistics. The result always contains the count of points within each area.
Note that analysis with hexagonal or square bins requires that your input geometry has a projected coordinate system. For analysis with H3 bins, it is expected that your input geometry is in the World Geodetic System 1984 coordinate system (SRID 4326). More information about projections and coordinate systems can be found in the core concept on Coordinate systems and transformations.
If your data is not in an appropriate coordinate system, the Aggregate Points tool will transform your input to the required coordinate system, or you can transform your data yourself using ST_Transform.
With that background out of the way, let's create some square bins. The example below uses the Aggregate Points tool to divide our data up into square spatial bins with an 0.1 mile long side.
# bin the data
from geoanalytics_fabric.tools import AggregatePoints
# aggregate points into square bins
result_agg = AggregatePoints() \
.setBins(bin_size=0.1, bin_size_unit="Miles", bin_type="Square") \
.run(df)
This results in a new DataFrame with bin geometries and counts for the features inside the bin. You can also add in other aggregated summary fields when you create the bins (e.g., minimum value, maximum value, sum, etc.).
Plotting the results of our aggregation shows a new pattern where we can more easily see and understand the distribution of our data.
Maps comparing the original dataset and the data binned into 0.1 mile square bins
When aggregating data using the Aggregate Points tool, you can also slice the data based on times instead of binning all of the data together. This allows for easier comparisons over time to look at patterns of change.
Binning with a time step requires that time is enabled on the input DataFrame (e.g., you have a timestamp column).
By default, time slicing is automatically aligned at January 1, 1970, and steps are calculated from that point. This can be adjusted if you want to specify a time for the time steps to begin (instead of the default January 1, 1970). This is done using a reference time. A reference time can be a date (e.g., January 1, 2016) or a time and date (e.g., January 1, 2016, at 9:30 a.m.). More details on time stepping can be found in the core concept documentation for time stepping.
Here is an example of creating a separate set of bins for each year in the input data. This will allow us to look at the individual years or change over time (e.g., absolute or percent increase / decrease over the past year).
# aggregate points into square bins - for every 1 year in the data. By default the yearly interval starts at January 1
# this can be adjusted with the reference_time parameter input
result_agg_yearly = AggregatePoints() \
.setBins(bin_size=0.1, bin_size_unit="Miles", bin_type="Square") \
.setTimeStep(interval_duration=1, interval_unit="years") \
.run(df)
As a result, we can look at individual years and make comparisons. To specify the year when performing analysis or plotting we filter using the step_start or step_end columns.
# plot results from aggregating points into square bins for _one_ year (2013)
result_plot = result_agg_yearly\
.filter(F.year("step_start") == 2013)\
.st.plot(cmap_values="COUNT",
basemap="dark",
cmap="YlGnBu",
vmax=1000,
figsize=(10,10))
Here is an example of two of the years in the result:
Comparing binned data for two different years
We can also use these results to calculate change over time for results like this:
Another way that we can explore patterns is through finding statistically significant hotspots. The Find Hot Spots tool identifies statistically significant spatial clusters of many records (hot spots) and few records (cold spots).
The Find Hot Spots tool calculates the Getis-Ord Gi* statistic (pronounced g-i-star) for each feature in a dataset. The resultant z-scores and p-values tell you where features with either high or low values cluster spatially. This tool works by looking at each feature within the context of neighboring features. A feature with a high value is interesting but may not be a statistically significant hot spot. To be a statistically significant hot spot, a feature will have a high value and be surrounded by other features with high values as well.
Find Hot Spots starts by dividing the input point dataset into bins. To use this tool, you must define both a bin size and a neighborhood size to use in evaluating whether or not a location is a "hot spot" relative to its neighborhood.
# import the FindHotSpots tool
from geoanalytics_fabric.tools import FindHotSpots
# use FindHotSpots to evaluate the data using bins of 0.1 mile size, and compare to a neighborhood of 0.5 mile around each bin
result_service_calls = FindHotSpots() \
.setBins(bin_size=0.1, bin_size_unit="Miles") \
.setNeighborhood(distance=0.5, distance_unit="Miles") \
.run(dataframe=df)
The results can be plotted and examined to see the areas where there are significantly more or fewer calls than the surroundings.
Hopefully this quick primer on exploring trends and hotspots has been helpful. We'll be posting additional content to help you get started, so check back on the Community for more! And please let us know what type of things you'd like to learn more about!
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.