Select to view content in your preferred language

Calculating vessel density and speed using the new GeoAnalytics Engine Track functions

216
0
2 weeks ago
DerekGourley
Esri Contributor
1 0 216

Introduction

Global Positioning System (GPS) data is often challenging to work with due to data volume and quality issues. For example, imagine a fleet of vehicles generating location data every second of every day. Until the data is cleaned up, the amount of data with inaccurate locations will increase, as will the inaccuracies in any resulting analysis. With the release of ArcGIS GeoAnalytics Engine 1.4, 18 new track functions were added that make cleaning and analyzing GPS data easier.

When considering the fleet of vehicles example above, or any large-volume GPS data, the new GeoAnalytics Engine track functions can be used to clean up the inaccurate GPS locations, calculate movement speed, and more to get the maximum value from your data. GeoAnalytics Engine functions and tools are built to work with big data. This means that the new track functions can clean up and analyze the GPS data in batches as it comes in each day, or in working with years of data at a time.

In a previous blog post, we used GeoAnalytics Engine to explore problems such as generating network paths and determining shipping lane centerlines from Automatic Identification System (AIS) data. In this post, we’ll demonstrate the more advanced types of analyses that can be done with the AIS data by using a few of the new GeoAnalytics Engine track functions: TRK_Aggr_CreateTrack, TRK_SplitByDistanceGap, TRK_Duration, and TRK_Speed.

combined_3.png

AIS data

AIS vessel data is collected by the U.S. Coast Guard and is available as a collection of comma-separated value (CSV) files for download from the MarineCadastre.gov website.  Each CSV file contains the vessel observations for a single day and has information such as: a unique vessel identifier, longitude, latitude, vessel name, vessel type, time of the observation, and more. GeoAnalytics Engine can read in and combine all the AIS data from these CSV files into a single DataFrame. This reduces the amount of time and effort needed to set up the AIS data for the analyses.

GeoAnalytics Engine can read in and preprocess the AIS data for the analyses using the following steps:

  • Import the GeoAnalytics Engine and PySpark functions
  • Create a DataFrame from the January 2023 AIS data
    • Create a point geometry column
    • Create a timestamp column
    • Select only the columns needed for the analyses
  • Filter the DataFrame to include vessels with a classification of "cargo" or “tanker”
  • Transform the point geometry column to a projected coordinate system to ensure that the spatial bins that are created all have the same area
  • Limit the extent of the data to part of the Pacific Ocean and the maritime boundary between the United States and Canada

 

 

# Import the modules needed for the analysis
import pyspark.sql.functions as F
from geoanalytics.sql import functions as ST  # GeoAnalytics Engine spatial type functions 
from geoanalytics.tracks import functions as TRK  # GeoAnalytics Engine track functions 

# Create a Spark DataFrame from the January 2023 AIS data
ais_points = spark.read.option("header", True).csv(r"C:\...\ais_2023\january\*") \
                  .withColumn("geometry", ST.point("lon", "lat", 4269)) \
                  .withColumn("timestamp", F.to_timestamp("BaseDateTime")) \
                  .select("mmsi", "vesselname", "vesseltype", "geometry", "timestamp")

# Select only the vessels that have a classification of "cargo" or "tanker"
ais_points = ais_points.where("vesseltype between 70 and 89")

# Transform the geometry to a projected coordinate system (54034: World Cylindrical Equal Area)
ais_points = ais_points.withColumn("geometry", ST.transform("geometry", 54034))

# Select the AIS data within the analysis extent of the Pacific Ocean and the maritime boundary between the United States and Canada
analysis_extent = (-14049143.490524316, 4642774.9528538855, -13563244.751623178, 4869331.901541563)
ais_points = ais_points.where(ST.bbox_intersects("geometry", *analysis_extent)) \
                       .persist()

# View 5 DataFrame rows and the total row count 
ais_points.show(5)
ais_points.count()
+---------+--------------+----------+--------------------+-------------------+
|     mmsi|    vesselname|vesseltype|            geometry|          timestamp|
+---------+--------------+----------+--------------------+-------------------+
|352001262|  BOHOL ISLAND|        70|{"x":-1.371275180...|2023-01-01 00:00:01|
|477014600|SEASPAN DALIAN|        70|{"x":-1.370911722...|2023-01-01 00:00:06|
|477112900| ORIENTAL KING|        80|{"x":-1.366506252...|2023-01-01 00:00:04|
|440057000|     OCEAN JIN|        70|{"x":-1.368320650...|2023-01-01 00:00:03|
|566407000|  APL LE HAVRE|        70|{"x":-1.371341526...|2023-01-01 00:00:03|
+---------+--------------+----------+--------------------+-------------------+
only showing top 5 rows

2415980

 

 

This post uses the AIS data for the month of January 2023, which contains ~242 million points before the preprocessing steps and ~2.4 million points after. To evaluate performance, the analyses from this post were also run using the full 2023 AIS dataset, which contains ~3 billion points. The performance section at the end of this post goes into more detail about how these analyses scale when working with the full 2023 AIS dataset.

The AIS data points from the DataFrame that was created above can be visualized using the built in Engine plotting functionality.

 

 

# Plot styles
from matplotlib import pyplot as plt

# Disable the plot axis tickmarks
plt.rcParams['xtick.bottom'] = False
plt.rcParams['xtick.labelbottom'] = False
plt.rcParams['ytick.left'] = False
plt.rcParams['ytick.labelleft'] = False

# Define a plot style for the AIS data
ais_plot_style = dict(figsize=(17,12), basemap="light", sr=54034, extent=analysis_extent, max_geoms=2_500_000)

# Plot the AIS points
ais_points.st.plot(**ais_plot_style, s=1);

 

 

ais_points.png

Creating tracks

In GeoAnalytics Engine, a track is a linestring that represents the change in an entity's location over time. Each linestring vertex has an m-value representing the timestamp. Linestring vertices are ordered sequentially (by increasing value) using the timestamp column. Tracks are created using a point type geometry column and a timestamp column. A visual example of a track is shown in the image below.

uc_2024_demo_theater_graphics_final_cropped.png

The TRK_Aggr_CreateTrack function creates a track for each vessel after the AIS data has been grouped by the Maritime Mobile Service Identity (MMSI) column, which is a unique identifier for each vessel.

 

 

# Create tracks from the AIS ship points
tracks = ais_points.groupBy("MMSI").agg(TRK.aggr_create_track("geometry", "timestamp").alias("track"))

# Plot the tracks
tracks.st.plot(**ais_plot_style, linewidth=1, alpha=0.2);

 

 

When visualizing the vessel tracks that were created, there are subsets of tracks that cross over the land.

tracks.png

To visualize what is causing the tracks to cross over the land, we can explore a single track created using a subset of data for a vessel that contains inaccurate observations.

 

 

# Create a track covering the path of a single vessel for a day that contains inaccurate data
inaccurate_track = ais_points.where("mmsi==355745000 and timestamp>'2023-01-08 00:00:00'") \
                             .groupBy("mmsi").agg(TRK.aggr_create_track("geometry", "timestamp").alias("track"))

# Plot the inaccurate track
inaccurate_track.st.plot(figsize=(17,12), basemap="light", xmargin=0.25, ymargin=0.25, linewidth=4, color="k");

 

 

invalid_track_1.png

The visualization shows the vessel track beginning in Puget Sound and then heading out to the Pacific Ocean.  Towards the end of the track, two observations are reported that show the vessel back in Puget Sound even though it’s still in the Pacific Ocean. The inaccurate observations cause the track to cross over the land.

These inaccurate observations can be explored further using two GeoAnalytics Engine functions. First, the ST_Segments function splits the track up into segments that contain only two points (vertices). The TRK_Length function is then used to calculate the length of each segment in miles.

 

 

# Create segments from the inaccurate track
inaccurate_track_segments = inaccurate_track.withColumn("segments", ST.segments("track", num_points=2, step_size=1))

# Calculate the length of each segment
inaccurate_track_segments = inaccurate_track_segments.select("MMSI", F.explode("segments").alias("segments")) \
                                                     .withColumn("segment_length", F.round(TRK.length(track="segments", output_units="miles"), 2)) \
                                                     .sort("segment_length", ascending=False)

# Plot the inaccurate track
inaccurate_track_plot = inaccurate_track.st.plot(figsize=(17,12), basemap="light", xmargin=0.25, ymargin=0.25, linewidth=4, color="k");

# Plot the 3 longest segments of the inaccurate track
inaccurate_track_segments.where("segment_length >= 12") \
                         .st.plot(ax=inaccurate_track_plot, cmap="hsv", cmap_values="segment_length",
                                  is_categorical=True, xmargin=0.25, ymargin=1, linewidth=4,
                                  legend=True, legend_kwds={"title": "Segment length in miles"});

 

 

When the three longest segments of the track are visualized, the distance in miles between the inaccurate observations are further apart than the vessel could have reasonably traveled in the time between observations.

invalid_track_2.png

With the information from exploring the single vessel track with inaccurate observations, we saw that the inaccurate points were more than 12 miles apart. Based on this we will select a slightly lower distance value of greater than 5 miles between consecutive observations to determine if an observation is inaccurate. The distance value used will depend on the data and the frequency between consecutive observations for the entire dataset. The TRK_SplitByDistanceGap function is used to split the vessel tracks anytime two vertices are more than 5 miles apart.

 

 

# Split the tracks based on a distance gap
split_tracks = tracks.withColumn("split_by_distance_gap", TRK.split_by_distance_gap("track", (5, "miles")))
split_tracks = split_tracks.select("MMSI", F.explode("split_by_distance_gap").alias("tracks")) \
                           .persist()

# Plot the split tracks
split_tracks.st.plot(**ais_plot_style, linewidth=1);

 

 

Visualizing the vessel tracks after the inaccurate observations have been cleaned up shows that the tracks no longer cross over the land.  Not only is this useful for cleaning up the tracks from vessels, but it’s relevant to any tracks created from moving objects that have inaccurate distance gaps between observations.  In the case of inaccurate time gaps between observations, the TRK_SplitByTimeGap function is used to split tracks anytime two vertices are farther apart than the specified gap duration.

tracks_cleaned_up.png

Vessel density

Vessel density maps provide a high-level summary of vessel locations and are used to help understand trends such as identifying areas with the greatest number of vessels and areas where vessels spend the most time. One important question with movement data is how long objects spend inside certain areas. For example, a vessel density map can be used to identify congested ports where vessels need to wait before unloading their cargo. To create a vessel density map with GeoAnalytics Engine, the first step is to use the ST_HexBins function to create hexagonal bins that cover the vessel tracks. In this case, based on the extent of the data, 5,000 meter (5 km) bins were used. The ST_Intersection function is then used with the hexagon bins to split the vessel tracks into segments for each bin.

 

 

# Split the tracks using hex bins
track_segments = split_tracks.withColumn("bin", F.explode(ST.hex_bins("tracks", bin_size=5000))) \
                             .withColumn("intersection", ST.intersection("tracks", ST.bin_geometry("bin"))) \
                             .withColumn("track_segments", F.explode(ST.geometries("intersection"))) \
                             .drop("tracks", "intersection") \
                             .persist()

# Plot the split tracks and the bins
split_tracks_plot = track_segments.st.plot(**ais_plot_style, linewidth=1)
track_segments.select(ST.bin_geometry("bin")).st.plot(ax=split_tracks_plot, edgecolor="crimson", facecolor="none");

 

 

tracks_hex_bins.png

One method of determining vessel density is to count the number of tracks within each bin.  Visualizing the count of vessel tracks shows the areas with the highest vessel traffic.

 

 

# Calculate the total vessel count for each bin
vessel_count = track_segments.groupBy("bin").count() \
                             .withColumn("bin_geometry", ST.bin_geometry("bin"))

# Define a plot style for the binned AIS data
ais_plot_style_bins = dict(figsize=(26,10), basemap="light", sr=54034, extent=analysis_extent, max_geoms=2_500_000)

# Plot the total vessel count for each bin
vessel_count.st.plot(**ais_plot_style_bins, cmap="plasma", cmap_values="count", legend=True,
                     legend_kwds={"label": "Vessel Count"});

 

 

vessel_traffic_bins.png

Another way of determining vessel density is to calculate the time spent by all vessels in each bin. The TRK_Duration function is used to calculate the duration of each track segment. When visualized, the total time spent by the vessels in each bin shows a much different pattern than the count of vessels per bin. From the visualization, it appears that vessels are spending most of their time in locations such as the Vancouver Harbor as indicated by the single yellow bin. This could be due to vessels needing to anchor and wait for an open spot at the dock before unloading their cargo.

 

 

# Calculate the duration of each track segment
track_segments_duration = track_segments.withColumn("track_duration", TRK.duration("track_segments", output_units="hours"))

# Calculate the total duration for all vessels in each bin
vessel_duration = track_segments_duration.groupBy("bin").sum("track_duration") \
                                         .withColumnRenamed("sum(track_duration)", "total_duration") \
                                         .withColumn("bin_geometry", ST.bin_geometry("bin"))
# Plot the total vessel duration
vessel_duration.st.plot(**ais_plot_style_bins, cmap="plasma", cmap_values="total_duration", legend=True,
                        legend_kwds={"label": "Vessel duration in hours"});

 

 

vessel_duration_bin.png

Vessel speed

Tracking vessel speed is important for reducing underwater sounds for whales and for safety especially in high traffic areas where vessels dwell before going into port. The Vancouver Harbor AIS point data is selected given that the vessel density visualization showed that vessels spent the most time there.

 

 

# Extent for the Vancouver Harbor
vancouver_harbor_extent = (-13707985.924030624, 4814096.888325807, -13694991.629222166, 4816838.044237577)

# Select the AIS data within the Vancouver Harbor
ais_points_vh = ais_points.where(ST.bbox_intersects("geometry", *vancouver_harbor_extent)) \
                          .persist()

# Define a plot style for the Vancouver Harbor data
vh_plot_style = dict(figsize=(17,8), basemap="light", sr=54034, aspect='auto', extent=vancouver_harbor_extent)

# Plot the data
ais_points_vh.st.plot(**vh_plot_style, s=1);

 

 

vh_points.png

The visualization of the Vancouver Harbor AIS data shows swirls of points, which appear to be areas where ships anchor. The GeoAnalytics Engine Find Dwell Locations tool is used to detect dwell points. A distance of 500 meters and 2 hours is used to represent vessels being relatively stationary for a minimum amount of time.

 

 

# Import the Find Dwell Locations tool
from geoanalytics.tools import FindDwellLocations

# Run the Find Dwell Locations tool
find_dwells_vh = FindDwellLocations() \
                    .setTrackFields("mmsi") \
                    .setDwellMaxDistance(500, "meters") \
                    .setDwellMinDuration(2, "hours") \
                    .setOutputType("AllPoints") \
                    .run(ais_points_vh) \
                    .persist()

# Plot the dwell points in red
dwells_vh_plot = find_dwells_vh.where("DwellID is not NULL").st.plot(**vh_plot_style, s=1, c="r", zorder=2, label="Dwell points")
find_dwells_vh.where("DwellID is NULL").st.plot(**vh_plot_style, ax=dwells_vh_plot, s=1, zorder=1).legend(fontsize=15, markerscale=10);

 

 

vh_dwells.png

Having identified the dwell locations, those points are removed, tracks are created for each vessel, and then cleaned up using the same 5-mile distance gap between observations that was used previously. This results in tracks that were created without the dwell locations, which can now be used to calculate vessel speeds.

 

 

# Select the non-dwell AIS locations in the Vancouver Harbor
non_dwells_vh = find_dwells_vh.where("DwellID is NULL")

# Create tracks from the non-dwell locations and split the tracks using a distance gap
non_dwell_tracks_vh = non_dwells_vh.groupBy("MMSI").agg(TRK.aggr_create_track("geometry", "timestamp").alias("track")) \
                                   .withColumn("split_by_distance_gap", TRK.split_by_distance_gap("track", (5, "miles"))) \
                                   .drop("geometry") \
                                   .persist()

# Select the tracks after they have been split using a distance gap
non_dwell_split_tracks_vh = non_dwell_tracks_vh.select("MMSI", F.explode("split_by_distance_gap").alias("tracks")) \
                                               .persist()

# Plot the split tracks
non_dwell_split_tracks_vh.st.plot(**vh_plot_style, linewidth=1);

 

 

vh_tracks.png

Using a reasonably small bin size for the area, in this case 50-meter hexagon bins, the tracks are then split into segments. Now that we have the tracks split into segments inside each bin, we can then summarize the speed of the vessels within each bin.

 

 

# Split the tracks using hex bins
non_dwell_track_segments_vh = non_dwell_split_tracks_vh.withColumn("bin", F.explode(ST.hex_bins("tracks", bin_size=50, padding=50))) \
                                                       .withColumn("intersection", ST.intersection("tracks", ST.bin_geometry("bin"))) \
                                                       .withColumn("track_segments", F.explode(ST.geometries("intersection"))) \
                                                       .drop("tracks", "intersection")

# Plot the split tracks and the bins
non_dwell_split_tracks_plot = non_dwell_track_segments_vh.st.plot(**vh_plot_style, linewidth=1)
non_dwell_track_segments_vh.select(ST.bin_geometry("bin")).st.plot(**vh_plot_style, ax=non_dwell_split_tracks_plot, edgecolor="crimson", facecolor="none");

 

 

vh_tracks_hex_bins.png

The GeoAnalytics Engine TRK_Speed function is used to calculate the speed for each track segment. Summary statistics can then be calculated to get an idea of the vessels’ speeds in the Vancouver Harbor.

 

 

# Calculate the speed for each track segment
track_segments_speed = non_dwell_track_segments_vh.withColumn("track_speed", TRK.speed("track_segments", output_units="NauticalMilesPerHour"))

# View the track segement speed summary statistics
track_segments_speed.describe("track_speed").show()
+-------+--------------------+
|summary|         track_speed|
+-------+--------------------+
|  count|               87053|
|   mean|   9.141383254544339|
| stddev|  3.7138950839277434|
|    min|7.785287402771251E-5|
|    max|  27.344280966239353|
+-------+--------------------+

 

 

The average speed of the track segments within each bin can be calculated to visualize how the average vessels’ speeds change within the Vancouver Harbor.

 

 

# Calculate the average speed for all vessels in each bin
vessel_speed = track_segments_speed.groupBy("bin").mean("track_speed") \
                                   .withColumnRenamed("avg(track_speed)", "average_speed") \
                                   .withColumn("bin_geometry", ST.bin_geometry("bin")) \
                                   .persist()

# Define a plot style for the Vancouver Harbor vessel speed data
vh_plot_style_bins = dict(figsize=(26,10), basemap="light", sr=54034, aspect='auto', extent=vancouver_harbor_extent)

# Plot the average vessel speed for each bin
vessel_speed.st.plot(**vh_plot_style_bins, cmap="plasma", cmap_values="average_speed", legend=True,
                     legend_kwds={"label": "Average vessel speed in knots"});

 

 

vh_speeds.png

The visualization of the vessel speeds within the Vancouver Harbor shows vessels moving the quickest when traversing the center of the harbor. Vessels speeds tend to decrease as the vessels approach the edge of the harbor where they tend to anchor.

Performance

Working with large volume GPS data is computationally intensive. As the data size grows, it can quickly exceed what can be done with traditional desktop computing resources.

 Local (AIS data for January 2023)Microsoft Fabric (AIS data for all of 2023)
Starting data size~242 million points~3 billion points
Data size after preprocessing~2.4 million points~24 million points
Time to read and preprocess the data~3.5 minutes~7 minutes
Time to complete the analyses~4.5 minutes~6.5 minutes
Total time~8 minutes~13.5 minutes

 

For the smaller dataset for the month of January 2023 with ~242 million points, the workflow could be completed in ~8 minutes with a desktop computer that had 4 cores and 16 gigabytes of ram. Breaking this down further, it took ~3.5 minutes to read in and preprocess the ~242 million AIS points for January 2023 and ~4.5 minutes to complete the analyses on ~2.4 million points.

GeoAnalytics Engine can run this all locally which is great for prototyping, but it was designed to scale to process large datasets in the cloud. This same workflow was completed on the full 2023 AIS dataset with ~3 billion data points in ~13.5 minutes using Microsoft Fabric (runtime 1.2) with the default compute settings. The default compute settings for Fabric are a Spark driver with 8 cores and 56 gigabytes of ram and a dynamic allocation of 1 to 9 Spark executors each with 8 cores and 56 gigabytes of ram. Breaking this down further, it took ~7 minutes to read in and preprocess the ~3 billion 2023 AIS points from an Azure Data Lake and ~6.5 minutes to complete the analyses on the ~24 million points remaining after the initial preprocessing.

This post focused on vessel data, but the track functions that were added in GeoAnalytics Engine 1.4 can be applied to any objects that have their positions periodically recorded over time. We would love to hear about how these new track functions can help with your analysis workflows, or if you want to know more about the other GeoAnalytics Engine functions and tools. Please feel free to provide feedback or ask questions in the comment section below.

About the Author
I am a product engineer on the GeoAnalytics team.