Select to view content in your preferred language

A Study of Data Velocity in Cloud Native Vector Data Distribution

162
0
Wednesday
BruceHarold
Esri Frequent Contributor
2 0 162

At writing I'm tasked with delivering a demo theatre session at the 2026 Esri User Conference titled Cloud-Native Georelational Data Distribution.  If you're coming to UC 2026 you can add it to your schedule.  If you can't make it then the book of the play is what follows below, however to encourage UC attendance there is a chapter missing from the book which I'll show live - namely how fast you can get data into your map or scene from AWS S3 while enforcing a data model for place-of-interest and building features - because you should always be thinking about an information product and not just data.

The plot has three threads to its data velocity story; dataset lifecycles I'm categorizing as:

  1. Infrequent periodic bulk replacement
    • Base layer data that isn't time enabled
  2. Frequent append and upsert
    • Operational layers that grow continuously and have life stages and are time enabled
  3. Medium frequency insert, update and delete edits
    • Operational layers that evolve and are time enabled

My goal is to show how data might be offered in each of the above velocity scenarios, in a cloud native way, and brought into ArcGIS.  The common thread is that the cloud native format we'll use is GeoParquet in a public S3-API compliant object store such as AWS.  In all cases we'll use ArcPy and DuckDB in notebooks or script tools for data consumption, with the understanding that a data custodian would supply consumers with the tools needed for ArcGIS to consume the data.  The combination of S3, GeoParquet and DuckDB provides performant and functional implementation in ArcGIS.

Let's dig in.

 

 

Periodic Bulk Replacement

 

My subject matter data is Overture Maps Foundation Division Area features, a global scale dataset released monthly.  The data model includes polygon geometry with a primary place name and a struct object with alternate names in many languages.  The source data at writing are ten GeoParquet files in AWS S3, with a common schema.  There is no logical partitioning.  The information product I want is a geodatabase feature class, related alternate name table plus a geocoding locator that understands all the names.  Here is what the feature data looks like over Europe:

Division AreasDivision Areas

To use my information product, for example if I want to find Madrid in Spain using the Bihari language (the popup shows available names for Madrid) I give मैड्रिड as the address:

मैड्रिड finds Madridमैड्रिड finds Madrid

A notebook is appropriate for the information product, you can find it in the blog download, and really its only trick is using the appropriate glob path to the GeoParquet data, see in this cell:

sql = f"""create or replace temp view division_area_view as select
                 id,
                 names.primary as primary_name,
                 class,
                 subtype,
                 region,
                 country,
                 version,
                 is_land,
                 is_territorial,
                 bbox.xmin as xmin,
                 bbox.ymin as ymin,
                 bbox.xmax as xmax,
                 bbox.ymax as ymax,
                 division_id,
                 geometry
                 from read_parquet('s3://overturemaps-us-west-2/release/{release}/theme=divisions/type=division_area/*.parquet',filename=false, hive_partitioning=1)
                 where {whereExp}
                 order by country, ST_Area(geometry) desc;"""
view = conn.sql(sql)

DuckDB can use glob paths for local or remote data, and make remote queries with the S3 API, and these queries are parallelized across all files in the glob path.  Note that the path includes a release identifier, this is taken from a STAC catalog at run time.  This is the central theme of this post - when and how to use GeoParquet and DuckDB with data that changes at intervals.  In this case we don't know what records changed so cannot query for them easily, so we do a bulk extract.

What if we do know which records changed?

 

Frequent Append and Upsert

 

My example "busy" data is 311 case data for San Francisco.  The dataset is continuously updated with thousands of cases a day opened, edited or closed, but not pruned, and goes back to 2008.  At writing the bulk download is 8 million features, seen here at 1:10000 scale:

311 Cases in San Francisco311 Cases in San Francisco

Many "event" datasets like this exist, so how might the dataset be efficiently delivered in a cloud native way?  The answer relies on the data having timestamp fields for when cases are opened, updated and closed, with the field updated_datetime being refreshed on any status change.  As the San Francisco open data site (and therefore the system of record behind it) has an API that supports querying then this can be done for changes based on updated_datetime.  Here is the approach taken in the tools shared in the blog download:

  • Do an initial bulk download to a baseline GeoParquet file
  • On a frequent schedule
    • Query the existing GeoParquet file(s) for the maximum updated_datetime value
    • Query the 311 system of record for records more recent than the maximum
      • This is a fast query
    • Write the query result to a new, additional GeoParquet file
      • All GeoParquet files must be at the same glob path
  • On demand, query the set of GeoParquet files to extract data of interest
    • This query makes use of a simple but powerful SQL clause, so read on...

Here is an example query where I extract into a memory feature class all 311 cases to-date for 2026 within a polygon - this took 4 seconds.  There are about 8500 features.

San Francisco QuerySan Francisco Query

The query tool is a script tool, its secret sauce is the QUALIFY clause.  The GeoParquet files made from daily case data will have duplicates due to the case lifecycle (opened one day, edited another day, closed a later day) then we want only the most recent row per service_request_id value across all GeoParquet files - the QUALIFY clause when querying the parquet data does this for us.

    conn.sql(f"""create or replace temp view sf311_view as select
                service_request_id,requested_datetime,closed_date,updated_datetime,
                status_description,status_notes,agency_responsible,service_name,
                service_subtype,service_details,address,street,supervisor_district,
                neighborhoods_sffind_boundaries,police_district,source,media_url,
                bos_2012,data_as_of,data_loaded_at,ST_AsWKB(GEOM) as wkb
                from read_parquet('{pqPath}',filename=false)
                where {where}
                and ST_Intersects(ST_GeomFromText('{wkt}'), GEOM)
                qualify row_number() over (partition by service_request_id order by updated_datetime desc) = 1;""")

 

So now we have a simple way to deliver fast-changing data in a cloud native way with fast queries.

What if we have quite big data but no way to query for changes?

 

Continual Insert, Update and Delete Edits

 

Data subject to heavy branch versioned editing is high value work for GIS, and while you can share the data state by giving access to its underlying feature services, adding a public mapping workload to the server will not be welcomed by the administrator.   It turns out you can share the state of the data, with support for time travel, using a cloud-native approach.  What enables this is the insert-only data model of branch versioning - the state of a feature at any moment is determined by GDB_FROM_DATE in combination with OBJECTID - the row with latest GDB_FROM_DATE for each unique value of OBJECTID is the current state of a feature, and if you query for earlier GDB_FROM_DATE values you get time travel.  The only trick is making GeoParquet files that represent edit moments, but then the QUALIFY clause comes to the rescue again to deliver the data you want.

Here are two views of parcel data over the same extent and using the same source GeoParquet files.

Current and earlier momentsCurrent and earlier moments

The left view is the latest moment, the right view is at an earlier moment, you can see a lot of parcel subdivision has taken place.  The data source retains the full data history, edits result in new GeoParquet files being added to a folder or cloud object store.  Here I have a baseline 1.46GB GeoParquet file for the original state of the data with a couple of small GeoParquet files containing edits over two weeks.

GeoParquet files containing full data historyGeoParquet files containing full data history

Here is a manifest of the blog download file CloudNativeDataDistribution.zip:

  • ImportCurrentDivisionAreas.ipynb notebook
    • Downloads Overture Division Area features to your project home geodatabase
    • Creates multi-lingual names for features in a related table
    • Creates or refreshes a locator using Division Area features as reference data
  • GetBaselineSF311 spatial ETL tool
    • Extracts the full 311 dataset for San Francisco to GeoParquet
    • Requires ArcGIS Data Interoperability
    • Requires an account and app token
  • GetUpdatesSF311 spatial ETL tool
    • Extracts 311 case data more recent than any in an existing GeoParquet file
    • Creates a new GeoParquet file
    • Requires ArcGIS Data Interoperability
    • Requires an account and app token
  • Generate311Points script tool
    • Demonstrates using the GeoParquet files to make a memory feature class
  • ExtractCurrentParcels.ipynb notebook
    • Demonstrate extracting the latest data state of GeoParquet files in a branch versioned data model
  • ExtractEarlierParcels.ipynb notebook
    • Demonstrate extracting an earlier data state of GeoParquet files in a branch versioned data model

Not included, but available on request, are the tools used to build GeoParquet files for the parcel data.  Remember, where my sample tools are using local file storage for GeoParquet, in production you would use a cloud object store, like AWS S3.

Now, while I hope to see you at the 2026 User Conference in San Diego, if you need more encouragement then here is sneak preview of a demo bringing Overture Building theme data into a scene. See if you can find a couple of easter egg script tools in the blog download 😉.

Overture BuildingsOverture Buildings

 

Tags (1)
Contributors