Go Cloud Native! Overture GeoParquet, From Object Store To Feature Layer Via Online Notebook

463
7
01-19-2024 11:39 AM
ShareUser
Esri Community Manager
0 7 463

Every now and then a compelling workflow is enabled by new ideas, and data distribution by cloud native formats is my example today.  Take this (very plain) map for example:

Places Of Interest in LondonPlaces Of Interest in LondonPlaces Of Interest in London

What you are looking at is a hosted  group layer in ArcGIS Online with 364503 place points in an arbitrary extent (the bounds of all city boroughs) in London, England, with related tables; 364503 address details, 364503 common names, 23130 business brand names, 281408 website URLs, 487799 source attributions, 259100 social media link URLs, and 329497 phone numbers.  If we explore places near the famous Harrod's premises in Kensington & Chelsea boroughs we can see dozens of business brands:

Godiva ChocolatesGodiva ChocolatesGodiva Chocolates

This being England, you might be interested in tea rooms about the city.  Using place category and related tables we can see not just their locations but their address, website, social media links and phone numbers:

Tea RoomsTea RoomsTea Rooms

 This place data is one of the themes from Overture Maps Foundation and is made available under this license.  If you surf the Overture website, you'll see it is a collaboration of Amazon, Meta, Microsoft and TomTom as steering members,  Esri as a general member, and many other contributor members and is envisaged as a resource for "developers who build map services or use geospatial data".  I'm democratizing it a bit more here, giving you a pattern for consuming the data as hosted feature layers in your ArcGIS Online or ArcGIS Enterprise portals.

Let's dig into the details of how to migrate the data to feature services.

The data is made available at Amazon S3 and Azure object stores as Parquet files, with anonymous access.   I'll let you read up on the format details elsewhere but Parquet is definitely one star of the cloud native show because it is optimized for querying by attribute column, and in the case of GeoParquet, this includes a spatial column (technically multiple spatial columns if you want to push your luck).  As GeoParquet is an emerging format it still has some things that are TODO, like a spatial index (which would let you query by spatial operators), but Overture very thoughtfully include a bounding box property which is simple to query by X and Y.

The technology that is the second star of the cloud native show is DuckDB.  DuckDB enables SQL query of local or remote files like CSV, JSON and of course Parquet (and many more) as if they are local databases.  Remote file query is especially powerful if the host portal supports the Amazon S3 REST API and a client that talks this can use HTTP to send SQL queries, which DuckDB can.  Especially powerful is DuckDB's ability to unpack complex data types (arrays and structs) into a rich data model like I'm doing here (base features and 1:M related tables), and not just flat tables.  Only the query result is returned, not the remote file.

The third star of the cloud native show is ArcGIS Online hosted notebooks, which can be used to orchestrate DuckDB transactions and integrate the results into new or refreshed hosted feature layers in Online or Enterprise.  The superpower of this combination of Parquet, DuckDB and Python is that global scale data can be queried for a subset of interest in a desired schema using industry standard SQL, plus automatedThis forever deprecates the legacy workflow of downloading superset data files to retrieve the part you want.  At writing, the places data resolves to 6 files totaling 5.54GB, not something you want to haul over the wire before you start processing!  If you think about it, any file format you have to zip to move around and unzip to query (shapefile, file geodatabase) generates some friction, Parquet avoids this.

The Notebook I used to import the places data is in the blog download.  Notebooks aren't easy to share graphically but I'll make a few points.

Firstly, ArcGIS notebooks don't include the DuckDB Python API, so it needs to be installed from PyPi, here is my cell that does the import and also loads the spatial and httpfs extensions needed for DuckDB in this workflow.

 

import arcgis
import arcpy
from datetime import datetime
import os
import zipfile
def getNow():
    return str(datetime.utcnow().replace(microsecond=0))
arcpy.env.overwriteOutput = True
sR = arcpy.SpatialReference(4326)
gis = arcgis.gis.GIS("https://www.arcgis.com", gisUser, gisPass)
!pip install duckdb
import duckdb
duckdb.sql("install spatial;")
duckdb.sql("load spatial;")
duckdb.sql("install httpfs;")
duckdb.sql("load httpfs;")
duckdb.sql("set s3_region='us-west-2';")
print('Processing starting at {}'.format(getNow()))

 

I did not model the full Overture places schema, only the elements that looked populated in my area of interest.

Spoiler
If you browse the places schema note the YAML tab.  Don't be surprised in 2025 if you see YAML powering AI to let you have conversations with your data 😉

Each extract from the S3 hive follows the same pattern.  Unfortunately a bug in DuckDB prevented me from using a more efficient approach (copying directly from memory to file geodatabase) so I fell back to ArcPy.  Here is the cell that extracts address data - note the UNNEST operator that unpacks a struct.

 

print('Extracting Addresses starting at {}'.format(getNow()))
sql = """with address as (select id, unnest(addresses, recursive := true) from places_view where addresses is not null
and {}) select id, freeform, locality, postcode, region, country from address;""".format(whereExp)
relation = duckdb.sql(sql)
addresses = arcpy.management.CreateTable(out_path=r"/arcgis/Places.gdb",out_name="Addresses").getOutput(0)
arcpy.management.AddField(in_table=addresses,field_name="id",field_type="TEXT",field_length=40,field_is_nullable="NULLABLE",field_is_required="REQUIRED")
arcpy.management.AddField(in_table=addresses,field_name="freeform",field_type="TEXT",field_length=1000,field_is_nullable="NULLABLE",field_is_required="NON_REQUIRED")
arcpy.management.AddField(in_table=addresses,field_name="locality",field_type="TEXT",field_length=100,field_is_nullable="NULLABLE",field_is_required="NON_REQUIRED")
arcpy.management.AddField(in_table=addresses,field_name="postcode",field_type="TEXT",field_length=1000,field_is_nullable="NULLABLE",field_is_required="NON_REQUIRED")
arcpy.management.AddField(in_table=addresses,field_name="region",field_type="TEXT",field_length=100,field_is_nullable="NULLABLE",field_is_required="NON_REQUIRED")
arcpy.management.AddField(in_table=addresses,field_name="country",field_type="TEXT",field_length=2,field_is_nullable="NULLABLE",field_is_required="NON_REQUIRED")
print('Addresses table created at {}'.format(getNow()))
with arcpy.da.InsertCursor(addresses,["id","freeform","locality","postcode","region","country"]) as iCursor:
    row = relation.fetchone()
    i = 1
    if row:
        while row:
            if i % 10000 == 0:
                print('Inserted {} rows at {}'.format(str(i),getNow()))
            iCursor.insertRow(row)
            i+=1
            row = relation.fetchone()
del iCursor
print('Addresses table populated at {}'.format(getNow()))

 

You'll see I set specific field widths for text fields.  Common web formats (CSV, JSON, Parquet) don't let you tune data types like this but I like to do so.  However if you're sharp eyed you'll see the postcode field has a crazy width of 1000.  The pattern I settled on to determine field widths was to have a utility cell (in the download) in my notebook during construction which I used to explore the data.  I found maximum field sizes that way.  It turns out the crazy big postcode I found looks like a data error.

Likely Data ErrorLikely Data ErrorLikely Data Error

To go into production with this approach first figure out a query on extent or address fields (region, country) that works for you, then plug it into the notebook after sharing it to ArcGIS Online.  You'll need to supply your own credentials of course, and change target gis if you're using Enterprise for the output feature service.  To initialize the feature service, run the notebook manually down to the cell that zips up the working file geodatabase, download the zip file from the notebook and share it as an item and feature service, then plug in the item id.  Then at each Overture release, change the theme URL to match the release and refresh your service.

To give you a feel for performance, it takes a few minutes for the S3 hive to be queried for each layer but the results write at tens of thousands of features per second.  It takes about 50 minutes for my test extent and data model, including overwrite of a target feature service.

Naturally, if the data schema changes or to support other Overture themes, you'll need to author notebook changes or new notebooks.  Do share!

I'm sure you'll agree that this is a powerful new paradigm that will change the industry.  I'm just following industry trends.  It will be fun to see if Parquet is what comes after the shapefile for data sharing.

The blog download has the notebook but not a file geodatabase (it's too big for the blog technology) but when you generate your own services don't forget the data license here.  Have fun!

7 Comments